22.11.2023

What's up with British Shooting's website?

RSSbotsaccessibilityredditjavascriptweb applications

2023-11-22 16:07 +0000

In my last post, I recounted the lessons I’d learnt building a reddit bot to repost target shooting news to relevant subreddits. I learned the hard way that RSS is amazing but subject to change and you need robust controls to stop the bot spamming your target if the upstream provider changes their site’s URL schema, does something weird to surface old posts at the top of their RSS feed, or entirely borks their site so your script is staring at previously unseen “dev.example.com” URLs.

I also alluded to the fact that in my list of sites, there was one notable exception - British Shooting. British Shooting are the equivalent of British Cycling. They handle all UKSport funding for our Olympic and World Class programmes and are ultimately responsible for developing Olympic and Paralympic talent.

Their website is also a disaster zone.

Looks alright to me!

So, their website looks great. It’s produced by Digital Glacier, who also work with UKSport and TeamGB - which becomes obvious when comparing the colour palettes and general styling. The problem comes if you attempt to access it using anything other than a mouse or touch-screen. And that should immediately set off red flags in relation to the website of a public-funded body who serve visual-impaired users - users who might want to access information via screen-readers or non-traditional user agents.

Where’s the RSS?

My first annoyance was discovering they had no RSS feed. That’s odd - most Content Management Systems spit one out whether you ask for it or not. So I dug in and… okay… this isn’t on an obvious CMS like WordPress, Joomla! or even Drupal. It’s something custom - and the developers haven’t taken the time to build an RSS generator.

No matter, there will be something I can consume. Maybe a sitemap?

No.

Not even a robots.txt.

What? Really?

At this point I was starting to think I was going to have to break out the big guns - write a separate Python module using the BeautifulSoup web parser to scrape the News page and grab new stories. So I pointed BS4 at the News page. And… it fetched gibberish. Because there’s no HTML on that site. Not really. None of the buttons or links are HTML anchor tags (<a href=…>). The buttons are all just blobs of javascript.

That’s when the penny dropped. Scraping this web application masquerading as a web site was going to end about as well as trying to generate an RSS feed of the latest actions taken in the Google Docs interface. It’s a black box, near-impossible to scrutinise - and this is why the only link you ever got for British Shooting in search results was the home page. It’s impossible for web crawlers to index past the home page because there aren’t any links to follow!

The weird thing is, the Team GB website isn’t like that - there’s JS, but it’s still a website. Accessible and crawlable. What did they do, let a work experience kid hyped up on React and Vue handle this small project - with the result being they wrote an impenetrable black box of JS? Good grief.

5, 6, 7, 8 how can we enumerate?

But wait, the articles are sequentially numbered. I can do something with that right? The script tries “britishshooting.org.uk/article/2506” and if it gets a 200 status response then I post it. If it’s 404 then I know I’ve hit the last article and stop, record that number and try it again tomorrow.

Oh, all the articles return a 404 response code. There is no page there - it’s a custom 404 error handler that fetches an article if there is one with the ID in the URL. There’s no functional way of distinguishing article IDs which actually return an article, and article IDs which return an error page.

Well balls.

So I left it - I had other projects to be getting on with.

Is that an API I see before me?

Earlier this year I got curious. Was it really as bad as I thought? Had I just missed something? I’d missed the RSS feed for the Shooting section of BBC Sport on the first pass as well, thinking I could only get a more general BBC Sport feed. So I had a nose and yes, it was exactly as bad as I remembered. But that 404 error handler had to be fetching data from somewhere right? And if it’s a big fat client-side framework, that means it’s probably making a call I can scrutinise. I inspected the network tab of the browser console in the vain hope that I could see some sort of endpoint or figure out where the content was coming from. And an endpoint is exactly what I found.

“britishshooting.org.uk/api/content/articles/”

API you say? Come to daddy.

After some brief experimentation it turns out that the article content for “/article/2506” is loaded from “/api/content/articles/2506” as a blob of JSON. I like JSON. Most importantly, the HTTP request to the API endpoint returned a 200 status code for extant articles and 404 when you hit an article ID that doesn’t exist (yet).

This then required a simple reworking of the basic logic from my existing bot. I have an article ID in a text file. I try it against the API and if I get a 404 we stop and try again tomorrow. If it’s a 200, we publish that article to reddit, increment by one and try the next ID. When we hit a 404, we save that ID and use it as tomorrow’s start point.

Not so hard after all.

Nothing in that was terribly difficult. It required some poking around and getting in the weeds. I now have a custom module which will read articles from the BS website until they update or change it.

But I still can’t get BS news in my personal RSS news reader - not without reworking their JSON into an RSS feed and publishing it myself! Web sites should use open standards. They should be accessible for everyone, including the visually impaired. They should use structured data markup (e.g. that laid out by Schema.org) to allow information to be accessed via voice assistants like Siri and Alexa or otherwise synthesised into search results. By publishing standard formats - like RSS - and using standard markup - like schema.org - we help users access our content and information using standard tools, without needing to build dedicated apps or platforms.

Terence Eden wrote about this in 2019 (The future of the web, isn’t the web) following an excellent article from the Government Digital Service (Making GOV.UK more than a website). Websites aren’t just a thing you look at. They are a source of content - for screen readers, search engines, voice assistants and a bunch of other User Agents that I don’t even know exist but are important to somebody.

Let’s keep that in mind as we plan our digital strategies. This isn’t to be stuck in the mud and spurn new stuff. But let’s treat the shiniest new JS library with due caution until it’s proven itself. Let’s not jump om the bandwagon of this week’s new hotness and abandon battle-proven and well-supported technology like structured markup and RSS. Websites have to last years. They need to be maintainable. They need to be accessible.