Achieving operational scalability in software for private markets

March 8, 2023

Paul Foley

Chief Technology Officer

Over the years I’ve been involved with a number of companies, some start-up, some turn around and others trying to achieve scale up – and all of these companies have a number of things in common.

So today I’d like to touch on one aspect that is common to every business, which will be a key success factor no matter what stage of its lifecycle the company may be in – operational scalability and its effects on UX.

If someone mentions UX, your first thought is probably UI and navigating an application in the smallest number of clicks – but behind the scenes there’s another element to providing a great user experience.

The UI stuff is great, but for your IT department that is not what we’re thinking about when we talk about the experience of the user.

When I talk about the user experience I’m thinking about the system performance, it’s availability, how it performs under load and how much down time is required any time we need to do anything to it.

In short, I’m talking about the one thing that most management teams don’t appreciate until it’s too late – non-functional requirements and their effect on operational scalability.

Operational scalability in this case relates to performance of your software and infrastructure and both are going to feed into your ability to retain both clients and staff.

So where do most companies go wrong?

In my experience there are two areas where companies tend to underestimate or completely neglect their own interests: system resilience and system performance.

System performance

Perhaps the best phrase to start with, even though it’s common sense, is “A proof of concept is not a product”.

I’ve seen a number of start-ups and blue chips alike that have developed a fantastic proof of concept. It looked amazing and for a small number of users (typically located in the same office) it worked perfectly. The management teams hailed the project a complete success and then opened the floodgates to the user community.

At which point the support staff were overwhelmed, developers stopped answering their phones and the management couldn’t work out why they were suddenly seeing so many angry emails.

The point that they overlooked was that their testing did not include load testing and only used a small number of well-orchestrated technical users performing tests. I’ve seen multi-million-dollar systems that could only support a handful of users at any time for exactly this reason.

This is a very simple trap to avoid – at design time the product management team should specify the KPI’s that the system must deliver i.e., how many transactions per second can the system process, how many simultaneous users can the system support, etc. and then the IT team should verify that the delivered solution meets these KPI’s. These are often referred to as part of the non-functional requirements of a system.

I made the point about a proof of concept not being a product as most POC’s are built to look great whilst being based on a minimum viable codebase – not a well-designed, hardened codebase.

System resilience

When I talk about system resilience, I’m typically thinking of a system’s ability to process data whilst under stress, to maintain referential integrity of data/processing and to be able to maintain operational capabilities. And the system’s ability to recover from the stress event, hopefully in a manner that insulates the client from the recovery mechanism (and any side effects).

To achieve this, we should consider the following items:

Recovery - can a specific component or service be restarted without effecting the entire platform, can it do it automatically?

Scaling - can the infrastructure automatically assign more resources (or less) to a component that is under stress? Does the infrastructure have a mechanism to detect these events?

Monitoring and Alerting - This is the really important one. Do you know what your system is doing (both the infrastructure and the components)? Do you know that there are problems BEFORE the client knows?

Note: You should be able to answer “Yes” to all three of these points.

There are a number of solutions to these items ranging from the use of elastic pools for scaling in cloud environments, through to using containerised component deployments and Kubernettes to enable scaling of the software components and automated recovery. And again, for monitoring and alerting there are a myriad of solutions including the built in solutions of Azure/AWS.

I mentioned that operational scalability will influence not only an organisation’s ability to retain clients, but also staff. This is because systems that have not been designed with operational scalability in mind will typically require a lot of unplanned effort from the IT team – which will result in them delivering less new functionality (because they’re spending more time diagnosing/fixing things manually), being required to work overtime to fix problems and will result in conflict between the development and infrastructure teams – not to mention the management pressure to deliver more. Why would anyone stick around for that?

So, if you take nothing else away from this article consider this, as part of your overall quality strategy, it is in your own self-interest to ensure that you deliver a performant, resilient system and the best way to do this is to consider non-functional requirements (and their impact on user experience) at design time – not once the emails start arriving.

‍

Paul Foley has worked in financial services for over 20 years, including trading, wealth management, private banking and institutional crypto. He joined qashqade in September 2022 as CTO and is responsible for product development and the IT function of qashqade. Pauls holds a BSc from Derby University.

‍