Category Archives: Software Development

Performance Test Rating Criteria

TL;DR: Load and performance testing produces a vast amount of data. This data has to be interpreted and communicated. Because not every interested party speaks the same language, Xceptance developed a performance test rating and grading system. It evaluates response time, stability, and predictability and transforms three factors into a simple and communicable form. While doing that, it does not compromise on quality. It has been successfully used in more than 400 projects.

The Challenge

Load and performance testing is a key activity for making an online business successful. It validates that traffic and conversion expectations can be fulfilled. This of course applies to all kinds of Internet-based applications. Basically, as soon as there are expectations in terms of stability and performance, a test is mandatory to validate these. Expectations are usually set as requirements by different organizational groups such as sales, product management, engineering, and development teams.

Every group has a different understanding when it comes to results, goals, and success criteria. Some might be more concerned with the business impact, others are looking for technical implications of design decisions, and some just want to improve performance.

The group that is tasked with the evaluation of the requirements is faced with a very wide range of success definitions. In addition, it has to explain its technical measurements to all participating parties so that each party easily understands the state of testing.

Engineers rather look for detailed metrics including but not limited to the system behavior under test, while business-centric stakeholders just expect a clear yes or no. But performance testing typically does not deliver a clear result.

How can one reach all target groups without causing too much extra work to cater to all individual needs?

The Rating System

Xceptance developed a rating system that uses an American education system-like grading from A to F. The grades A to C symbolize a pass, while D and F are considered a fail. A grade B stands for an assumed average across similar customers and projects. It also stands for a good result. This leaves room in both directions to over or underperform.

Because performance results are not just shaped by response times, three factors are taken into account:

  • Response Times
  • Errors
  • Predictability

Before continuing a detailed discussion of the factors, this table explains each grade in one sentence. For the more ambitious customers, the A+ stretch goal was added.

Let’s talk about our three factors in detail now.

Response Times

To clarify one thing first, this is about server-side performance. It does not take the client-side rendering including the loading of any content or JavaScript execution into account. For this, Google Web Vitals provide an excellent scoring system. In addition, when you evaluate API performance, you might want to refine your performance expectations.

A response time typically defines the time from initiating a request to a server until the full response has been returned to the requestor. It is likely the most used number to describe performance.

Server response times are validated using percentiles. This number indicates what response time a certain amount of users experience at maximum. For the response time grading, the P95 value is used. That means that 95% of requests have been finished within that time or faster. 

One of the reasons for going with the P95 is that the user of provided infrastructure and software has often only limited influence on the behavior of the full stack. Hence there will be peaks and outliers which cannot be controlled nor diagnosed.

One can use the average, but it is a very imprecise number. It will either move too much or too little in regard to outliers. It just does not show what most of the users will truly see. An average user will never post something negative on social media, but the last 15% of the response time range will. So the rating cares more about the negative impression than the positive. This sadly ignores the last 5% and that is where projects become expensive. It is really hard to convince all stakeholders to use the P95 instead of the average, trying to establish the P99 or higher is nearly impossible. 

The response times are grouped by activities because the user usually is more impatient at the beginning of its user journey. This data is based on years of measurements and ensures that goals are achievable but also ambitious enough. They all might heavily vary on your platform and stack. So these are not set in stone and these examples might only work for commerce.


The fastest response times are of no help if the response is incorrect. So correctness is key to performance testing. One might just say, every response code larger than 400 is a failure, but it is not that easy. Hence this factor distinguishes between two different types of errors: technical and functional errors. 

Examples of functional errors are, not fully set up products, products that cannot be bought together, or an empty search result. If a response was not received or the response code indicates a problem, it is considered a technical error. Or in other words: If the user is properly notified about the problem, it is mostly a functional issue. If the state or messaging is undefined, it is mostly technical in origin. 

A performance test has to ensure that technical errors are clearly highlighted because they might be caused by performance testing due to volume or scaling. Additionally, a performance test has to ensure functional correctness, because if the order cannot be placed and this is measured without validating the result, the test is just void.

To assign a grade to the error factor, this is verified:

  • How many percent of the visits have been affected by an error?
  • Do the errors occur in patterns or clusters?
  • Are there any main features affected such as checkout or sign in? 

Since barely any test run is error free when reaching a certain complexity, a low error rate might still be acceptable, even for an A rating. 

The disadvantage of this approach is that errors become an accepted fact.


In a perfect world, response times would be the same all day long. But that is impossible to achieve in a complex IT landscape, hence it is essential to minimize the noise. That is what this rating tries to capture. It introduces a predictability grade. Because the evaluation of noise is a complex process, Xceptance decided to retract to a metric anyone can grasp.

Predictability is defined by the business impact on the end user. It uses response times exceeding a certain value and the occurrence of technical errors (mainly non-recoverable errors such as response codes 500 or no response at all). This should give the merchant an idea of how many visits might have been affected by problems or slowness. Furthermore, being affected means that a visit carries the risk of a loss in revenue or reputation.

The threshold of 10 seconds is based on the user perception model published in the Google RAIL Model. This model applies to the total page loading time. The factor predictability uses the 10-second threshold for the runtime of a request. This makes it even softer than the value Google suggests, because the additional loading time of images, CSS, JavaScript is ignored.

Apart from the affected visits, this factor also considers response time patterns, increasing response time over time, and any sudden runtime changes.

To determine the grade these two questions are asked:

  • How many visits are affected (see formula above)?
  • Are there any patterns in response times such as increases, waves, or repeating spikes?


This rating system provides comparable and understandable results and caters to the needs of many stakeholders. Xceptance rolled it out to almost all load and performance tests and it proves to be effective and reliable. Some customers even arrive now with a goal of a certain rating to improve on the previous year’s results.

Xceptance has put that performance rating guide under the Creative Commons license – CC BY-SA 4.0. Feel free to use it for your everyday work, improve on it, and please let us know any feedback. We certainly appreciate it.

You can also use this slidedeck for communication and documentation purposes –

Java Training Sessions

Today we are going to publish four of our Java training sessions so you can use the material and benefit from it.

Let’s get started with four direct links to extensive material that might help you to understand Java or code quality better or just help you to reflect on topics you already know.

  • The Java Memory Model: Why you have to know the JMM to understand Java and write stable, correct, and fast code.
  • Java Memory Management: Know more about the size of objects and how Java does garbage collection.
  • High Performance Java: All about the smart Java internals that turn your code into fast code and how you can leverage that knowledge.
  • High Quality Code: The anatomy of high quality code that supports longevity, cross-team usage, and correctness. This is not just about Java, this is about good code in general.

Show a little patience when loading the training, these are all large reveal.js based slide sets. Use the arrow keys or space to navigate. Because the slide sets are designed to be interactive sessions, in many cases, not the entire slide context is revealed at once but block by block.

We publish these training sessions because they are also based on openly shared material, it greatly helped us to advance and understand, as well of course advertise a little what Xceptance might be able to do for you.

We will release more of our material in the next weeks and month, so everybody can browse and learn. This won’t be limited to Java and also cover material about approaching load testing, how to come up with test cases, and more about the modern web and its quality and performance challenges. Of course there will be more Java material too. You can get a glimpse of it when you just follow this link and page through the slides: The Infinite Java Training. Please remember, not all material is complete yet.

If you like the material and you need an audio track aka a real presentation, please talk to us. If you see other training needs in the area of quality assurance, testing, and Java, please contact to us.

More to come.

Thuringia’s Open-Source Prize for XLT

Wolfgang Tiefensee, Thuringia’s Secretary of Commerce, in conjunction with the board of directors of the IT industry network ITNet Thuringia, awarded the first Thueringen Open-Source Prize to three companies, all of them software companies based in Jena: TRITUM, Xceptance and GraphDefined.

Open Source Prize Title Picture Second Place

It is an honor for Xceptance to be the second-place winner of this competition. This result clearly demonstrates that open source as a component of commercial products can be a clear competitive advantage. XLT incorporates a number of open-source projects, including Apache HttpClient, Jetty, HtmlUnit, JUnit, and the Apache Commons libraries. As part of developing XLT, Xceptance is involved in testing and providing feedback for these projects, thus giving back to the open-source community.

While XLT is itself not open source, Xceptance does provide the software free of charge and with virtually no usage restrictions, so for most applications there is no noticeable difference to open-source software.

Use XLT with Sauce Labs and BrowserStack

Sauce Labs and BrowserStack – What Are They and Why Use Them?

This approach still work fine, but we came up with a much better one. Head over to GitHub and see our Multi-Browser-TestSuite for XLT. It will make multi browser testing a breeze. By the way, all the code is licensed under the MIT license, so absolute flexibility for you.

Sauce Labs and BrowserStack allow you to run automated test cases on different browsers and operating systems. Both provide more than 200 mobile and desktop browsers on different operating systems. The benefit? You can focus on coding instead of having to maintain different devices. You can easily run your test cases written on iOS on an Internet Explorer without actually buying a Windows device; and last not least, you don’t need to worry about drivers or maintenance.

By the way, Internet Explorer even seems to run faster at Sauce Labs than on a desktop machine. Also note that Sauce Labs supports Maven builds.
Continue reading Use XLT with Sauce Labs and BrowserStack

Tutorial: Git – The Incomplete Introduction

Software Testing is part of software development. So you need a form of revision control for your source aka test code, and documents. You also need it to be able to review code, compare the history of code… or maybe simply to help others to master it.

We recently started our migration from Subversion to Git. Not because we have been unsatisfied with SVN, mostly because we want to use what our customers use. Additionally we want to profit from the different functionality Git offers, such as local commits and cheap branching.

But Git is different and just changing the tool does not change anything, it might even turns things worse. Because you cannot run Git like SVN. Well, you can, but that still requires you to know the basics of Git to understand what it will do to your work and how a typical workflow looks like. The commands are different too.

So we created this tutorial to get used to Git, understand, and learn it.
Continue reading Tutorial: Git – The Incomplete Introduction

HPQC and XLT – Integration Example

You have to work with HP Quality Center (HPQC), but you don’t want to execute all the test cases manually. You automated some tests using XLT Script Developer and like the outcome. You want to use the Script developer much more but you face one last problem: You still have to enter the test results manually into HPQC. This renders some of the test automation advantages useless.

The following example can mitigate that problem. HPQC offers an API called Quality Center Open Test Architecture API (OTA API).
Using this interface, you can set test results automatically.
Continue reading HPQC and XLT – Integration Example

Spurious wakeup – the rare event

After hunting for quite some time for a strange application behavior, I finally found the reason.

The Problem

The Java application was behaving strangely in 4 out of 10 runs. It did not process all data available and assumed that the data input already ended. The application features several producer-consumer patterns, where one thread offers preprocessed data to the next one, passing it into a buffer where the next thread reads it from.

The consumer or producer fall into a wait state in case no data is available or the buffer is full. In case of a state change, the active threads notifies all waiting threads about the new data or the fact that all data is consumed.

On 2-core and 8-core machines, the application was running fine but when we moved it to 24-cores, it suddenly started to act in an unpredictable manner.
Continue reading Spurious wakeup – the rare event

One digit version numbers only, please!

Just read about a nice small software problem at Opera. Their latest browser is version 10, but they couldn’t continue to use the version number in the user agent string, because some web sites try to identify the agent version and fail with 2 digit version numbers. Seems to be similar to the famous Y2K problem, but now it is a BVN problem – a browser version number problem.

“…It appears that a considerable amount of browser sniffing scripts are not quite ready for this change to double digits, as they detect only the first digit of the user agent string: in such a scenario, Opera 10 is interpreted as Opera 1. This results in sites mistakenly identifying Opera 10 as an unsupported browser, thereby breaking server, as well as client-side scripts…”

Read more at Dev.Opera.