CPAN Testers is only made possible with the support of our sponsors.
For more information on sponsoring, please visit the I CPAN Testers website.

Upgrade Notice

The CPAN Testers Blog site has been upgraded since you last accessed the site. Please press the F5 key or CTRL-R to refresh your browser cache to use the latest javascript and CSS files.

News & Views

Posted by Barbie
on 29th September 2011

The last couple of weeks have been distracting for reasons that have nothing to do with CPAN Testers. Sadly Facebook and Google decided to enact their Naming Policy on my accounts, as apparently people in the real world can't possibly exist with mononyms! In the discussions that followed on-line, Peter Edwards sent me a link to a rather interesting article. Prior to the CPAN Testers server problems, I had started to add the Facebook Like and Google+ Plus buttons to pages across the CPAN Testers sites. Having read this article I have now decided to remove them. Although the initial idea behind these buttons seemed to be quite a nice promotional tool, the subversive way they are being used is an invasion of your privacy that I don't wish to be a part of. Just to be clear, this decision has nothing to do with my issues with the respective Naming Policies, but everything to do with the invasion of privacy. I recommend you read the article for further information.

Having struggled with the server over the last few weeks, with the database build and reindexing taking weeks rather than days, in my last update I asked for help. Ioan Rogers stepped forward and did some analysis of our set-up. He helped to pin-point some system apps we didn't need to have running, but in general the set-up did seem fine. However, the disk IO was still a problem. Using 'htop' and 'iotop' we could see that the RAID management software was at 99% or more most of the time, even when there wasn't much being written to disk. We then installed 'atop', which highlighted the issue considerably. One of the disks was running at over 100% of resources to deal with IO, while the other was fluctuating around 4%.

Thinking the disk itself was a problem I contacted the guys at Bytemark, to see whether they could see a problem with the physical disk. On investigation they identified a kernel bug that was blocking IO unnecessarily. A kernel upgrade and reboot successfully cured the problem. My thanks to Ioan Rogers for the inital help and advice, and a big thank you to the guys at Bytemark, James Lawrie & Chris Cottam.

Having got the server back on track, I then turned my attention to the performance fixes I had started to add to the feed parser. I am pleased to say that once again thanks to Devel::NYTProf, the feeder code is now parsing 3 hours worth of reports in roughly 20 minutes. Having restarted a week prior to the disk crash, the feeder started from the reports posted on the 20th August. At this rate I expect us to be fully up to date by next week. I'm already planning to have the Reports website back online possibly over the weekend.

The Statistics website and Development websites are back online, although there are still links and data that isn't up to date. The Preferences site is likely to take a little longer as I now have to apply again for an SSL certificate, but as I'm not switching the emails just yet, this shouldn't be too much of a problem just at the moment.

All being well we should be almost fully operational next week. Apologies it's taken so long to get everything back online, but rest assured no reports have been lost.

Posted by Barbie
on 20th September 2011

The MySQL databases have now been rebuilt and correctly synced between each other. The SQLite databases are now being updated and these should be completed by the end of the week. This has enabled me to write some simple scripts to create and repair the databases, which I'll now be including in a separate distribution to be released on GitHub. It will also include all the apache, mysql, logrotate, cron and other config and script files, so that if we ever have to rebuild again, getting started is a lot easier.

The websites are now rebuilding too. The statistics site will be complete shortly, as it takes a while to rerun all the statistical analysis. Trying to analyse everything all at once tends to grind the server to halt, whereas analysing bitesize chunks, although slightly slower, uses less memory and saves progress to disk, so we don't have to start again from scratch if anything falls over.

The reports site still has 27,000 entries to process, so I'm not expecting that to be down to a more reasonable number until the weekend. Turning the builder back on has highlighted some of the tweaks that are now missing from the work I did in the few weeks prior to the disk crash. In time I hope to address these performance tweaks again, but for now I'm just letting it build in peace :)

I'm now preparing to turn back on the feeds, to catch up on all the reports submitted in the last few weeks. I have run a couple of requests to fill in some of the gaps, and all seems to be fine, so I hope that a few days of dedicate feed processing should get through the bulk of reports from the last 3 weeks.

I haven't switched apache back on yet, as one thing that I've noticed is that server load is getting unnecessarily high. I know there is an intense amount of disk IO happening at the moment, but there appears to far too much disk access that there was previously. As a consequence I've started to reduce both the number of logs and log output. In most cases the output was only for information or debugging, so isn't necessary except in unusual circumstances. Hopefully after the weekend I'll get to switch Apache back on and can start some performance tuning.

If anyone has knowledge of performance tuning Debian Squeeze system processes, please get in touch. Trying to work out what tasks are essential and what can be run less frequent is getting tricky.

File Under: server
NO COMMENTS
Posted by Barbie
on 14th September 2011

Initial checks on the database highlighted some discrepancies, which have now been fixed. The databases have now been archived and are now rebuilding. It is hoped that this will be completed within the next few days.

Once the database are all rebuilt and sync'ed, the websites will slowly be switched back on. The first sites that will appear will be the Statistics and Devel sites, with the Reports website coming back online once the bulk of the support files (JSON, JS & HTML) have been recreated.

The CPAN Testers server is also one of the Tier-1 fast mirrors for CPAN. With this being quite important for a number of services, this was the first part of the server to be rebuilt. Finding a suitable BACKPAN seed has proved troublesome, as apart from the FUNET server there are no public rsync mirrors. While previously the FUNET server has been fine for seeding, David Cantrell highlighted that some of the timestamps in the repository are incorrect. It appears someone has touched some of the files on the FUNET server, without realising the consequences. As such the current BACKPAN repo may not correctly list the upload dates.

Despite several traumas, particular with permissions, the repos for BACKPAN and CPAN are now available for FTP and rsync access. Note that all the FTP and rsync (as well as HTTP .. more of that in a moment) paths/modules all use the capitalised versions of BACKPAN and CPAN. This is more specific to rsync, as previous modules were lower case.

All websites, including the HTTP access to BACKPAN and CPAN, are currently unavailable. With the databases rebuilding disk IO is at full throttle. Apache unfortunately also tries to access many files, particularly for logging, and the load of the server is impacted considerably. As such, to allow the databases to rebuild as quickly as possible, the webserver will remain turned off.

After turning on FTP access I was quite intrigue to see Google attempting a denial of service on the server. Having approximately 50 bots all trying to scan the FTP directories was not good. I have now blocked access to Googlebot, and will do so for any other bot that I see using the FTP or rsync repos. I don't have a problem with a single connection scanning the directories, or anyone requiring a full archive download, but any IP blocks trying to access the server all at once will be blocked.

We're getting there, but it's just taking a little longer that I'd hoped to get ourselves back online. If this has taught me anything it's that, while database and source code backups are all well and good, a complete and regular backup of web directories and config files are also extremely useful!

More news soon.

File Under: server
NO COMMENTS

All was going well last month, we had a few problems balancing the feed and database inserts, but with some help from Devel::NYTProf, we had improved performance and were getting back on top of things. We also had two presentations at YAPC::Europe by myself and Léon, and were heading for 1 million test reports submitted in a single month ... when.....

The CPAN Testers server hard drive developed a fault. Tthis wouldn't have been a problem had the mirrored drive, which had failed earlier in the year been correctly replaced. Alas the first failed drive was still absent, with the second drive failing and switching to read-only mode almost immediately on reboot. As such I had just started to backup in case we lost the drive completely, and then it did fail completely :( Our hosting company required us to accept complete data loss before replacing the two drives.

Before I continue, please note that this hasn't affected the Metabase server, and all reports currently being submitted are being safely stored. Once the CPAN Testers server is ready the feed will be switched back on and the the server will start updating the websites again.

Starting from scratch has meant reinstalling all packages and modules needed, as well as reloading the database with over 10 years worth of data. The cpanstats tables I took from a backup a week before the drive failed, as it appears some faults had affected the more recent backups. These have now all been reloaded. The metabase tables are currently being uploaded for 7-8 million records (and associated indices). The articles tables unfortunately hit a problem. The original database backup has a fault in the tar file and with it being so huge, tar couldn't cope with trying to fix it. Thankfully, Robert and Ask gave David and I an archive of all the original NNTP posts back last year, and these are being reparsed and inserted into the database. Once all these metabase and NNTP article records have all been uploaded, I'll then be running some database checks to ensure we have everything all sync'ed to the last known point.

The first websites and processes to be put back into the action, have been the archive accesses to the CPAN and BACKPAN directories. You can once again get full access via the HTTP, FTP and rsync protocols, to the CPAN Testers' Tier 1 (fast mirror) services.

It seems there were 2 sets of files that I hadn't backed up for quite sometime, which I really should have done. The first is the apache configuration files, which have been tweaked every so often, so while I have some backups for the entries, they are mostly out of date. As such please be aware that some sites may come back quicker than others. The second set are the cronjob files, which again have been tweaked since the backups I have. While its nothing I can't fix, it just may take a little longer to get the server back on form.

Interestingly Leon had asked me in Riga about the backups, and whether we needed somewhere else to store them. Thankfully the offsite backups are in London and Birmingham, but these only cover the database and source code files. The resulting website files will need to be recreated. As such, I'll now be looking to make a periodic snapshop of these so we can rebuild from a known point rather than rebuilding from scratch.

Our biggest problem is time. With approximately 250GB of database data to insert and index, and then having to rebuild the website data files, I'm not too sure how long it will take to get us back on our feet again. I hope that some of the sites will be back online by the end of the week, but the main Reports site may not be fully operational until next week. Please bear with us, and we'll get things back on track as soon as possible.

One last thing I had meant to include in last month's summary. I was interested to see the graphs the Carey Tilden had done using the CPAN Testers Statistics data. I may well include similar graphs on the stats site in the future. So thanks Carey for the idea.

Hopefully, I'll have a more uplifting summary next month, and will be able to confirm our report submission rate. I'll also have some news of other things happening within the CPAN Testers world.

<< October 2011 (2) August 2011 (4) >>