CPAN Testers is only made possible with the support of our sponsors.
For more information on sponsoring, please visit the I CPAN Testers website.

Upgrade Notice

The CPAN Testers Blog site has been upgraded since you last accessed the site. Please press the F5 key or CTRL-R to refresh your browser cache to use the latest javascript and CSS files.

News & Views

Posted by Barbie
on 25th May 2010

For some time now I have been promising to publish the list of CPAN Authors who for one reason or another, appear to be uncontactable. These are authors, who when using their PAUSEID @ address, generates a rejection of some form. Typically this appears to be for addresses which are set up to be used in redirection, which have since been made obsolete (perhaps from changing jobs or personal domain), but there are a few that are just not setup to receive emails.

I have now created a Missing In Action page, which I will keep updated as I receive bounce backs from the CPAN Testers daily summaries. In some cases bounce backs require verification from a real person, and I do eventually get round to those, but the ones listed are specifically for emails that are outright rejected. If you have not been receiving any CPAN Testers summaries or reports and have been expecting some, please first check this list to ensure that you haven't had your preferences disabled.

If you are on the list, please read the instruction on the page, to learn how to enable your PAUSE email, and to get your CPAN Tester preferences re-enabled.

File Under: pause / prefs
Posted by Barbie
on 25th May 2010

Recently Leo Lapworth updated several of the sites he's been working on to list the current stats for CPAN, as was previously seen in the footer of each page. It's good to remind people (even subconsciously) just how big CPAN is. In his post Leo also wondered what the 20,000th distribution was. With the CPAN Testers database holding lots of stats about CPAN, as well as the CPAN Testers reports, it was fairly straightforward to extract some numbers. In fact it proved so straightforward I promised to include it in the CPAN Testers Statistics website.

I'm pleased to say I finally found some time to do just that, and have revamped the CPAN Statistics page to include the CPAN Milestones. Now you can keep up to date with what distributions are hitting some significant milestones.

Posted by Barbie
on 17th May 2010

Over the weekend, I put together the evidence for the Microsoft Bing team, to help them understand why their bots are hitting the site so aggressively. In addition I put several questions to them as to how their bot clusters work, and why they don't follow some very simple rules.

To give you an understanding of the problem, the following are some basic stats gleaned from the logs, for those requests identifying themselves as 'msnbot'.

First: 09/May/2010:02:26:22 +0200
Last:  15/May/2010:01:21:42 +0200
Total Hits:      82130
Total Requests:  79999
Total IPs:       161
Status Code 200: 8473
Status Code 301: 83
Status Code 403: 73572
Status Code 500: 2
Max host per second:  74 [09/May/2010:10:42:06 +0200]
Max hosts per second: 15 [12/May/2010:16:19:42 +0200]
Max hits per second:  77 [14/May/2010:15:03:15 +0200]

The First/Last timestamps are the first and last entries in the logs that were processed. A period of roughly 6 days. Total hits is the number of entries processed, and total requests is the number of non 'robots,txt' requests, followed by a breakdown of the count for each status code returned.

However, the next 3 are the areas that are most questionable. Max host per second, is the greatest number of hits in a single second by single IP address. The Max hosts per second, is the greatest number of unique IPs submitting a request in a single second. Lastly, Max hits per second is the greatest number of requests seen in any single second.

So 15 unique IP address hitting the site in a single second, with a maximum rate of 77 requests a second. Does that strike you as reasonable?

In my response to Microsoft I have pointed out that when a bot receives a 403 for every request (for the last 4 months), the bot should be intelligent enough to back off the requests and default to the robots.txt. As it turned out the Apache config was also returning a 403 for the blocked IPs (which has since been fixed), though I maintain that if a bot receives a 403 for a request for robots.txt, it should equate to being disallowed to search and index the site.

However, the attack on Wednesday was from a fresh set of IPs that weren't blocked, and were able to read the robots.txt, which explicitly disallows msnbot. All these new IPs completely disregarded robots.txt and searched and indexed the site anyway. I had I not been on the machine at the time, they would have knocked the machine offline.

Another further point I made, was that CPAN Testers is a non-commercial, non-High Availability site, funded by the Perl community and administered by volunteers. Sending 77 requests a second is exactly the same as a denial of service attack for us. We don't have funds and resources to managed a fleet of web and database servers with a load balancer calmly managing requests for us. Treating us (or any other non-commercial site) like a major news site is just plain irresponsible.

One link that was sent to me, highlighted that this aggressive behaviour has, at least in one case, cost a site owner money in excessive bandwidth charges. If Microsoft do not put some reasonable controls on their bots then it is only a matter of time before someone sues or contacts the FBI for computer use.

I have provided the logs to Microsoft, so we'll see what their response is. As of Saturday, they obviously changed something, as all hits to the domain all switched to just robots.txt requests, seeming to honour the request not to index the site. However, I was still seeing 4 requests a second for a single IP, so it seems they have still some work to do.

File Under: server
Posted by Barbie
on 13th May 2010

Back In January, I reported how Microsoft had launched what amounted to a denial of service attack on the CPAN Testers server. It seems that 4 months later, we have yet again been targeted for attack from Microsoft. After the last attack, any IP address matching '65.55.*.*', hitting the main CPAN Testers website, was blocked (returning a 403 code). Every few weeks I check to see whether Microsoft have actually learnt and calmed down their attack on the server. So far, disappointingly, despite an alleged Microsoft developer saying they would look into it, the attack on the server has continue with little alteration to their frequency and numbers. Had they changed and been considerably less aggressive I would have lifted the ban.

Yesterday, Microsoft launch a further attack on the server using a complete new set of IP addresses. Now, just to clarify, this wasn't just a complete new set of IP addresses, but a completely new set PLUS the original set, thus effective doubling the attack on the server. Now you could claim stupidity or ignorance on behalf of the msnbot/Bing developers, but after being warned last time, and receiving 403s from their existing bots, by adding in a whole new set of IPs, I consider this latest attack nothing short of malicious.

These new IP address have now been added to the blocklist, and I'm now writing a script to alert me should any new IP address from Microsoft be added to their attack formation. Thankfully, I happened to be on the server at the time as both attacks hit, and managed to catch the IPs before they took out the server completely.

With my last post about this, I was accused of doing a disservice to Perl. Had I not been furious at the time, and written about the incident, I wouldn't have learnt that this was a Microsoft tactic that had infuriate a lot of people, and discovered that I wasn't the only sysadmin or website administrator around the world that had chosen to block Microsoft from their websites and servers. If Microsoft think thuggery is the way to improve their search content, then they are very sadly mistaken.

Update: Microsoft have now been in touch, and again apologised. We'll have to wait and see whether this can be resolved.

File Under: server
Posted by Barbie
on 12th May 2010

David Golden will be speaking at OSCON this year, talking at 10am on Friday morning, 23rd July 2010. His talk is entitled Free QA! What FOSS can Learn from CPAN Testers, and looks at how CPAN Testers provides a good example of how successful Open Source QA can be. Drawing on his experience from being a CPAN Tester, toolchain developer and leading the design of CPAN Testers 2.0 and the Metabase, David shows how other projects can benefit.

It looks to be a very engaging talk, and I know David would very much appreciate your attendance in his talk if you are attending OSCON.

<< June 2010 (3) April 2010 (2) >>