CPAN Testers is only made possible with the support of our sponsors.
For more information on sponsoring, please visit the I CPAN Testers website.

Upgrade Notice

The CPAN Testers Blog site has been upgraded since you last accessed the site. Please press the F5 key or CTRL-R to refresh your browser cache to use the latest javascript and CSS files.

News & Views

Posted by Barbie
on 11th November 2011

Recently Andreas alerted me to a problem with the SQLite database used to store the basic metadata for the CPAN Testers statistical database, aka cpanstats. On reading the database, for some queries an error message is now being returned; "database disk image is malformed". It's unclear where the error has occurred, but it seems to have been something that has only surfaced recently.

As a consequence I am now rebuilding the complete SQLite database. This means that the downloads available from the Development website will remain static until this is complete. Once complete and all is fine, then the backup mechanisms will be re-enabled.

However, there is the possibility that the database has grown so large now (with over 17 million records), that the data storage, and particularly the indexing, is not being written to disk correctly. With the database currently being around 5GB uncompressed, and just under 1GB compressed, it would be beneficial to reduce its size for efficiency, disk IO and bandwidth. So I have started to think of the alternative options.

Firstly the easiest option would be to create a database that only includes reports for distribution releases on CPAN. This would reduce the size slightly, but as we are now submitting reports at a far greater rate than for those releases that reside only on BACKPAN, the short-term gain would be quickly lost within a month or two.

Secondly we could only store data for a set period, such as per year. This would allow those that only need updates to just retrieve the latest instance (current + previous year), while those that need to start from scratch would need to download all the annual snapshots first. This would mean that all the data is easily accessible, but it does require anyone who wishes to maintain their own database to have a mechanism to update their local copy, rather than just download and have the database instantly ready.

A third option is to provide an API that can provide a list of the updates in CSV format, with the ability to specify a from date and to date, with the full SQLite database being created on a monthly basis. This would reduce disk IO and bandwidth, but would still require a local copy to be maintained with reqular updates.

A fourth option would be to switch to a MongoDB (or similar) database and enable replication. This would reduce bandwidth usage and reduce disk IO, however it would mean that anyone wanting to maintain a local copy would have to put in a lot more work to set-up and configure their local copy, and would require them to change any tools they currently have to use the new style of database storage. In the longer term, this is possibly an idea for a Google Summer of Code project.

Of all of these the journalling style of the third option would probably be the most practical in the short- to medium-term, while I think moving to a database that supports efficient replication would be better long-term. All would require work and/or tools for anyone maintaining their own copy, but if we can prepare all the necessary code before switching over, the change shouldn't be too painful.

These are all only ideas at the moment, and no plans have been made to change anything. If you have other suggestions that might be viable, I'd be please to hear from you. I'd also like to know if you currently download the SQLite DB, whether any of these ideas would work for you.

As mentioned, no changes are planned and beyond the current rebuild, the current DB is not going away any time soon. However, it does make sense for us to look at more efficient ways of exposing the data. Scaling for the future should not really be looking to continually copy the same 5-10GB of data daily around our eco-system.

October was a rather quiet month publicly, but behind the scenes there has been a number of fixes, improvements and discussions.

The most notable fix was getting the summary emails running again. It is still not perfect just yet, but I'm hoping that the ongoing fine-tuning will reduce the bouce-backs and faults that I'm currently seeing. The configurations are all from what was in the database at the end of August, so apologies if you've wanted to change these. The configurations are normally exposed via the Preferences website, but as the SSL certificate for the new server hadn't been approved it has a little longer than anticipated to get it up and running again. GoDaddy have now approved this and it's all ready to go, so I'm hoping this can all be sorted in the next few days.

There are still some problems building the web pages, with some reports gettting missed. However, I now have tools in place that double-check what is being processed, and re-inject any reports to the build queue that may have got missed. The fact that the builder and the feed parser are now capable of processing so much so quickly does now mean that to a large degree proessing is now near real-time. Typically the majority of reports are now available on the Reports website within a few hours of being submitted.

Some of the discussions behind the scenes have been to discuss sponsorship. We currently have two large international corporations who are interested in supporting CPAN Testers. Hopefully we can progress this over the next few months and early next year we can give you more details. The discussion of sponsorship have also led to discussions of potentially setting up a donation fund for CPAN Testers. Over the years we have had several people approach us asking if they can contribute to CPAN Testers. As we aren't a legal entity, it hasn't been something we have actively persued previously. However, we are now in talks to set something up for the future, and again hopefully in the new year we'll be able to tell you more details.

I would like to thank everyone who has been very supportive of CPAN Testers over the last few months. It has been a frustrating few months, but so many have contacted me personally or posted on lists and IRC to add their thanks and support for what we do. It really is very much appreciated.

To finish with, I have a few stats for you. As mentioned previously in August we managed to submit over 1 million reports in a single month. In the two months following we haven't quite matched that submission rate, but it's not been too far behind. So much so that we have now nearly 17 million reports in the database. We currently have reports covering 27 different Operating Systems, although around 18 of them are being tested on a regular basis. The page builder is now capable of processing several hundred thousand reports a day, and so can easily cope with the growing number of testers and smokers we'd like to see contributing to CPAN Testers.

There is still a lot happening behind the scenes, but looking forward to 2012 we are in a much better position to grow CPAN Testers for the long term future. We have an enviable infrustruture within the programming language and QA worlds, and we like to keep it that way :)

<< December 2011 (1) October 2011 (2) >>