What do Scunthorpe, Clitheroe and Penistone have in common?

Computers make very fast, very accurate mistakes

Building the world’s best website or nailing your SEO is all for naught if end users can’t access it. Known colloquially as “the Scunthorpe Problem”, net-nanny filtering used by schools, libraries and indeed ISPs (in their “Family Filter” options) have thankfully moved on from naive keyword blacklists which block sites about Essex or Sussex. But even with modern context-aware semantic analysis of site content, “overblocking” still raises challenges. What is the difference between a site discussing sexual health and a site with pornographic content? This is perfectly obvious to a human, but not to a computer. Computers are masters of malicious compliance. They do exactly what they are programmed to do - no more, no less. They can make very fast, very accurate mistakes.

Even the major players at the forefront of machine learning can struggle - in January 2021 Facebook was forced to apologise after removing posts and content featuring “Plymouth Hoe”. How many innocent gardening-based posts had been bumped off in the past, but in small enough numbers to avoid forming an obvious pattern!

Target Sports

A long-standing debate in target shooting fraternities is whether the firearms we use are “weapons”. We’ll leave that to one side here, but this has important implications online where “weapons” is a category used by web-filtering software which generally gets a site tagged as PG-13 or higher. “Weapons and Violence” is often a check-box that corporate and education network administrators can explicitly filter (along with forums, social media, shopping, pornography, gambling, alcohol, dating, etc). Whether you agree that all firearms are inherently weapons, we can probably find common ground in the notion that if for instance the NRA and NSRA websites should be categorised as “weapons”, they should not be categorised as that alone. They should also be tagged with “Sports” or “Non-Profit”, and are certainly not R-Rated.

The duality of the shooting sports poses significant difficulties for algorithms. It would be laughable to ban the IOC website (olympics.com) - which contains discussion of the “weapons” involved in shooting, archery and fencing. It is also reasonable that a primary school wouldn’t wish children to stumble over a hunting forum containing “bag” photos of dead quarry.

This is incredibly frustrating for target clubs and associations who find themselves blocked by ISP Parental Controls - which (since 2011) are often enabled by default with the customer required to opt-out.

Cockup rather than conspiracy

It is tempting to complain this is the work of “the antis” (whoever they are) trying to squeeze out shooting sports, but the rot runs deep. Restaurants and country pubs can be caught up in alcohol categories. Sites addressing bullying or sexual health are mislabelled as violent or pornographic. The Sexual Advice Association is a UK registered charity providing sex education resources. Good luck accessing their site from education networks though:

img

It is inevitable that in trying to categorise the internet, mistakes will be made. As it stands, the only way around this is to appeal to those collating the data for manual human review. In some cases it is also reasonable to reach out to their customers (e.g. complaining that BT are blocking a site wholesale).

The good news is that many operators know their categorisations are weak. By way of example, the categorify database (used by CleanBrowsing) allow you to check sites against their list. Entering reports that the site is listed as “Weapons” and PG-13. It also notes there is no violence, abusive language, nudity or sexual content.

Categorify also provide an API in addition to their website, and this is where it gets interesting:

{“domain”:“nsra.co.uk”,

“ip”:“85.233.160.139”,

“country-code”:“uk”,

“country”:“UK”,

“rating”:{“language”:false,“violence”:false,“nudity”:false,“adult”:false,

“value”:“PG-13”,“description”:“Not recommended for kids under 13.”},

“category”:[“Weapons”],“confidence_level”:“low”,

“keyword_heatmap”:{“nsra”:93,“rifle”:69,“shooting”:65,“smallbore”:62,“association”:60,“club”:45,“national”:36,“target”:29,“shop”:27,“surrey”:25,“news”:25,“lincolnshire”:24,“annual”:24,“championships”:24,“pistol”:22,“competitions”:22,“county”:22,“airgun”:21,“sports”:20,“overview”:20}}

The API feed not only exposes the category, but also the confidence level (in this case “low”). It also gives us the most-frequent keywords that they’ve scraped off the website - after “nsra”, the terms “rifle” and “shooting” are massively ahead of words like “association” or “club”. When you see how the analysis is conducted, it becomes clear where the problems emerge (what’s less obvious is how they could improve).

My green ink is ready, who do I write to!?

There are three approaches here where a site is being blocked:

  1. Go to source and request re-categorisations
  2. Request that ISPs unblock your site if they’re over-blocking
  3. Make your site less likely to be mis-categorised in the first place

Request Database Recat

There are a number of major providers who you can check against:

All the services listed offer a form to request re-categorisations. Unfortunately only Categorify has been receptive to actually changing Categorisations. The others appear a bit reticent and you just get a templated response saying it has been changed to… whatever it was before. In the case of Fortinet I have had emails come back so quickly I doubt any human review is actually taking place.

Request Network Unblock

Some time ago I wrote about the Open Rights Group’s Blocked project. Go and check if ISPs are blocking your site. They’re probably using lists from a provider above, so apply pressure from both directions by working through the ISPs.

Make your site more readily understandable to bots

It’s fair to say that this shouldn’t be our job. It is the responsibility of businesses selling category databases to get it right. Nonetheless, just as you seek to make your club’s website visually appealing, easy to use and well-ranked in search engines, so we should give thought to categorisation.

Take a look at the keyword heatmap for your site on categorify (click “Developer mode” at the bottom of the page and look for the string of words after “keyword_heatmap”). If you see a lot of “weapon friendly” terms up front then check your copy and your site keywords. Consider replacing words like “shooting” with phrases like “target sports”. Indeed make sure the word “sport” is scattered liberally throughout. Is this dumbing-down? Perhaps even “giving in”? Maybe. But remember that computers are extremely stupid. They will comply with their programming exactly. We’re just taking account of that process. If the word “sport” doesn’t appear in the keywords, then that’s a legitimate issue with the site - you’re not accurately representing the club.

All doom and gloom?

Well no. If this was a massive issue we’d have noticed more pervasive over-blocking before now. Generally a “Weapons” categorisation won’t get a site banned unless an administrator has explicitly filtered on the “Weapons” category. In my research, sites were accessible through CleanBrowsing’s public Family Filter even with a “Weapon” and “PG-13” rating. That filter just kills R-rated material, anything tagged Pornographic or nudity as well as proxy and VPN endpoints. It is generally institutional users such as schools and libraries where the additional filters might get toggled “on”.

That said, it’s worth checking. Categorify had essexsra.co.uk categorised as “Weapons, Adult/Pornography” and rated “R & NSFW”. It will have been inaccessible from many networks - and not because of the shooting content! I don’t know how it acquired the porn tag, but I was able to submit a recap request and it’s no bad thing that it’s fixed.

My next project is leveraging the ~550 domains in clubfinder to investigate the prevalence of overblocking and establish if it can be tackled at scale. Write up in due course.