As summer rolls on and customers turn to rollovers, sprucing up lists, and other maintenance tasks such as metadata refreshes, our alerting and ticketing systems have been exercised a little more than normal.
With this in mind I’ve been spending the last 8 weeks working in detail with our engineering team and seeing what we could do to make this period smoother for customers.
When debugging complicated problems, I always prefer us to act on evidence before instinct, and we invest in several world-class technologies to help us monitor performance and find problems before users do. Thus the first port of call was to take a look at why those systems weren’t always picking up problems before our customers.
The level of granularity we measure at was the problem here, so whilst performance was generally excellent over the huge volume of traffic we serve week to week, our setup masked very specific problems. The team made some changes to how key user activities are monitored, as well as increased the monitoring and alerting checks on background jobs in particular, which gave us a much better evidence base to work from.
Finding problems is one thing, being sure fixes do not actually make things worse is another. We’ve been working on a scaling lab for some time, which is a replica of our live setup, under “laboratory conditions” which allows us to test the effects of changes. In practice this involves a rack of real servers isolated in a controlled network environment to eliminate all external factors and give us confidence that the only variable affecting results is the change under test.
We use recorded traffic and data from the real system in the lab to make sure we can replay real-world scenarios and make sure major changes will make a positive impact for users.
We’ve also been looking at developer tooling and making sure every developer has a standardised replica of the production environment available to them locally, and standardising on debugging and code inspection tools to make sure we have the right kit to find problems quickly.
Finally we needed to make some cultural changes around how we work with regard to incident reporting and analysis, so every issue becomes a learning opportunity rather than a firefighting exercise.
The remainder of this post goes into some detail around specific problems reported by users and how these changes helped us resolve them. If you’re non-technical, feel free to jump straight to the conclusion!
1. Some data slow to propagate
Talis Aspire makes use of a graph data model, and we created the Tripod code library to manage loading and saving data from the database. This library gives the option of regular saves, where data is immediately consistent, or fast saves, where the save is acknowledged straight away, but much of the hard work is done on a queue away from the user. This means data propagates through the system in the background.
This approach is known in software engineering as “eventual consistency”. In normal conditions, fast saves should be consistent in no more than a few seconds, so they are useful when you know the user does not need the data immediately.
An example of where we use fast saves is when using the bookmarklet to save a reference to an external site. The workflow is to invoke the bookmarklet, review the data collected, hit save and then exit back to the external site; with the data saved being used at a later time.
We monitor the length of time it takes for data to become eventually consistent and in recent weeks were receiving reports that occasionally it was taking too long. The changes we made to the granularity of our monitoring helped us see this was a particular problem for bookmarking.
A few customers had raised tickets too reporting that their bookmarks were not always ready to use when they came to add them to a list. We can add more capacity for background work but as soon as we readied new servers, the performance would return to normal and the alerts would clear.
On looking into this in more detail we saw that other seasonal background work, such as rollovers and preparation of data for the upcoming live reports feature, were causing user-initiated actions such as bookmarking to be bumped down the queue.
We fixed the problem in two ways. First, we implemented a priority strategy for fast saves, where we can indicate some background work is more important to be addressed quickly. Thus we can jump the queue for bookmarking vs. longer running jobs such as rollovers.
Secondly, as a result of looking at our processes for introducing more capacity, we made it much easier to increase the overall number of servers available to process background work so all work is processed faster. We can now deploy more capacity in a few minutes, so if we experience spikes in demand in the future, we can quickly add more capacity as and when it’s needed.
2. Power users experiencing slow saves
Having solved the problem above, we received reports some users were still experiencing general slow downs in the system when saving any kind of data – whether fast saves were used or not.
Some users make way more use of the system than others – we call these power users. They might be members of library staff responsible for digitisation, or updating lists, or even extremely keen academics. These users have way more data than regular users – tens of thousands of bookmarks, and often hundreds of lists associated with their profiles.
Our analysis showed that these reports were usually coming from these power users.
The number of power users proportional to the number of ordinary users is small, so again general monitoring that looks at live system averages are often not useful in picking up these kinds of problem.
Our improvements to developer tooling meant we were able to replicate the data of these users in our development environment and recreate the issues easily. This helped us massively in pinning down the improvements required.
The culprit in this scenario was the method we were using to insert updates into the database itself. This relied on replacing documents rather than just adjusting changed properties.
Generally, documents in our database only have a few properties (10 to 20) and replacing the whole document on change is a quick operation as you don’t need to work out up front which have changed – you just save the new state. However, power users generally have a lot more data, sometimes 1000’s of properties depending on the specific data to be saved, and replacing the whole document even if just one property has changed was a slow process for these users.
We implemented a new strategy based on in-place updates instead and found that saves for power users could be up to 80 times quicker – a handy improvement!
But before release we wanted to ensure that in serving power users better, we hadn’t impacted regular users. There is no substitute for testing using real world data and usage patterns, what works in theory isn’t always so in practice.
So we turned to the scaling lab where we can simulate real world load that our system encounters in a repeatable, controlled way. We measured the performance of the system in the lab both before and after this change to validate our work. By changing just one variable (introducing in-place updates) we were confident that our fix would work out in the wild.
We released this improvement a week ago and so far the feedback is that power users are enjoying some measurable improvements.
3. HTTPS slow down
We introduced HTTPS support last year and an increasing number of customers have been taking up the option to have their system served fully over HTTPS.
We’d received a few reports about response times being poor. However, our alerting system hadn’t picked up anything out of the ordinary, but indeed we could see for ourselves in accessing customer tenancies that HTTPS performance was not up to scratch.
We were puzzled at why our monitoring system had not picked this up, so worked with our supplier on how to split out monitoring for customers using this feature and indeed could see that response times had been creeping up over time.
Our monitoring also allows us to split out average time our app spends executing code vs. the network and response time back to the user, and we could see that the application remained consistent over this period with customers using ordinary HTTP.
This allowed us to focus on the specific infrastructure differences about how a HTTPS customer is set up, and we were able to narrow the fault down to the way Virtual Hosts (vhosts) are configured in our web servers.
Previously there was a single vhost for all HTTPS traffic and every new customer added to the feature added a new rule to that vhost. This is fine for a handful of customers, but as the ruleset grows each new rule adds a tiny bit of time to every HTTPS request.
The issue was solved by changing our configuration to use a Virtual Host (vhost) per customer with one rule per vhost. This way adding new customers never affects existing ones.
The solution was extremely simple in the end, but we could have burnt a lot of time here had we assumed the problem was in our application code. Our monitoring solution helped us drill down quickly on our web server configuration instead and implement a quick fix.
4. Release process causing problems
All of our software is delivered as a service, with Talis managing the rollout of new releases. Over the years we’ve done hundreds of zero downtime releases of our software.
Over the last few weeks we’ve been making some significant changes under the hood to support the upcoming general release of live reports and list reviews, and in general adding more capacity to the infrastructure as the system continues to grow. We’ve had a number of releases not go as smoothly as we’d planned, and although no outages or downtime were recorded as a result, they have caused temporary glitches for some users.
When operational issues such as this occur, we have an incident reporting process whereby a coordinator collates a timeline of events during the incident, which is then shared with the whole team for comment. We use techniques such as 5 whys to get to the root cause of why the issue happened, and what we could put in place to prevent a recurrence.
The idea is never to apportion blame. You can guarantee that sometimes humans make mistakes, so rather than waste energy on blame, the focus instead is to look at broken process – or how a lack of process – lead to specific faults.
We also see human error as potential for automating tasks – engineering out the humans wherever possible!
Specifically in relation to releases, we used the results of our analysis to completely rewrite and redocument our release process, introducing canary releases and automatable smoke tests.
We also made changes about what constitutes an ordinary release (which can be made with no downtime) and which releases are riskier and therefore need a scheduled maintenance slot.
As software products grow in usage and complexity it is inevitable that they hit performance snags along the way.
It’s natural to want to jump right in and fix things immediately when things are affecting users, but care is needed to ensure you aren’t solving one problem by creating another, especially for a system of our size which serves well over a million users and an increasing roster of Universities across the globe.
This is why we believe in an evidence-based approach where we understand exactly why problems occur, and where we can test out theories and measure their effects before making major changes.
As well as investing significantly in technology to help us, we are also continually looking at ourselves and how we operate as a team, using every incident report as an opportunity for learning and improvement.