Load, Stress, Performance Test Terms, Deliverables, Profiles and Reports

This page presents the formatting and presentation of a sample performance profile . It is a companion to pages on Performance Testing Plan each aspect of performance of an application — based on statistics, graphs created by load testing tools such as LoadRunner executing scripts.

Site Map
About this site

Introduction: Our Flow of Deliverables

The project objective is for the project team to present Recommendations to address (in an actionable manner) the Concerns that various stakeholders have about IT support of specific Business Processes.

Stakeholders' concerns are clarified into a set of questions that can be answered based on factual metrics that signal whether pre-determined business goals and technical requirements are being met.

Action recommendations are based on statistical Conclusions arrived thorough analysis of results generated from several types of performance testing

Our rigorous approach applied procedures guided by a set of technical objectives and budgets

Flow of Information in a load profiling project

Definition of Terms

Results from a test run (such as in these statistical reports and graphs generated by LoadRunner) are the values obtained from measuring the impact of a specific set of run conditions.

Conclusions (such as these) are subjective decisions (a proposition or claim) reached after (hopefully) thoughtful consideration of the facts drawn from evidence provided.

In formal statistics, a conclusion evaluates the accuracy of prior hypotheses that is either accepted (confirmed) or rejected based on the outcome of experiments.

Conclusions are presented organized to the questions and forms/types of performance tests

A finding is a determination about the scope, validity, and reliability of observed facts (data). Example:

"The GUI Response timing of an average 1.3 seconds to obtain a response to a valid login request reflects what might be typically experienced by normal production users."

This statement limits its scope to:

valid requests from "positive" test cases, not invalid requests from "negative" test cases intended to end in errors.
login requests only, not any other type of request.
typical loads, not heavy loads with a lot of users
normal users, not users processing an extreme amount of data.
users in the open production enviornment, not in a closed testing/development environment.

Findings provide the premises (the "truths" or evidence) providing the basis for making conclusions.

All this is the path to a well-reasoned approach to the management of Performance, Scalability, Reliability (PSR).

Get this print framed for your wall!

Concerns, Questions, Metrics, and Goals

Here are the most common concerns. Each may have different importance depending on the organizational context.

# Concern Questions Metric Goal

I. User productivity

What is the fastest response time that users can expect from each dialog/screen/function of the application
User
Response
Time 6 sec.

How fast does the application detect, report, and recover from various user errors
User Error
Recovery Time 2 sec.

How frequent do spikes in response time occur
Mean Time to spike in response time none

How much degradation in performance in response time can be expected as the application processes larger amount of data
Data Size-Response Time Curve see chart

How much degradation in response time can be expected as the application gets busier (process more users/transactions simultaneously)
Load-Response
Time Curve see chart
II. Operational efficiency: (Maintainability)
Stability of the configuration

Planned vs.
Unplanned
Effort

How long can the application run before needing manual intervention
Mean Time Between Failure (MTBF) or Intervention 1 week

How quickly can configuration changes be made
Mean Time to Repair (MTTR) 2 hours

How long does it take to backup disk images off servers
Backup Speed -

How long does it take to restore disk images on servers
Restore Speed 1 hour

How long does it take to restart servers
Restart speed 5 minutes

How quickly can each component failover to redundant resources
Failover speed 5 seconds

How quickly can the system failback from redundant resources
Failback speed 5 minutes
III. Stress on the common infrastructure
How many bytes are sent back and forth between client and server?

How many resources (servers, memory, disk space, connections, file handles, etc.) does each part of the system need/consume
Are proportionately greater resources consumed at larger loads?
Resource
Consumption
(Usage)
Rate -

At what level of resource consumption should operations be alerted for manual intervention
Alert Level -
IV. Capacity of the configuration

Throughput
at various
Simultaneous
User
Loads

At what load does the application process at unacceptable response times (e.g., over 4 seconds)
Point of
Throughput
Degradation -

At what load does the system reject transactions
Point of
Throughput
Rejection -

At what overwhelming level of load do servers shut down

Point of
Throughput
Failure -
V. Resource Utilization

(Actions which expand capacity of existing machines)

Where is the bottleneck (which component response degrade most as load increases)
Component Causing Throughput restriction database

What is the impact of changes / tuning options (such as application software versions, utilities, OS settings, JVM settings, etc.)
Gain from tuning
VI. Capacity for growth

How well do provisions for balancing load allocate work among machines
Effectiveness of Load Balancing 50%

How well can individual machines "scale horizontally" when a server is upgraded with higher capacity components
Gain from
upgrading components 10%

How well does the system "scale vertically" when more machines are added
Gain from changing config. 10%

How much longer can business volume grow before reaching a point of degradation
Reserve
Capacity 30%
VII. Extent and Efficiency of Testing Effort (Testability)

How many features/lines (or function points) should be/are in scripts created to conduct load test runs
Lines to test 2,000

How long should/do load test runs take
Hours to run test, write analysis report 3 hours

How many runs should/does it take
Test Runs 3 each

This format is based partly on the Goal/Question/Metric (GQM) method, a practical method for quality improvement of software development, (McGraw-Hill, 1999) by Rini van Solingen and Egon Berghout at www.gqm.nl

Requirements

An example of a complete non-functional performance requirement is:

500 of 1,000 logged-in users, who pause an average of 20 seconds between a mix of requests (designated in table below) containing between 20 and 30 line items, obtain a completion response with no errors on IE6, IE7, and Firefox browsers in under 6 seconds 95 percent of the time. This measurement accesses the 1GB corporate local area network (LAN) during working hours (7 A.M. to 7 P.M. EST) with a normal load of background processes running throughout the period measured.

This statement answers basic questions:

How many users can the system handle?
What is the system's maximum throughput?
What are the sensitive hardware components?
How many servers are required at each tier?

Project Technical Objectives to Address Concerns

The project objectives addressing concerns which prompted this project are:

As a practical matter, bugs in Configuration, Installation, Security, and Failover/Recovery (Robustness) are often needed before conducting Performance, Load, and Stress testing:

# Concern Questions Project Technical Objective: Type of Testing

I. User productivity a-e Conduct speed tests to estimate the responsiveness of each user action. This identifies opportunities for application and configuration tuning.

II. Operational efficiency:
Stability of the configuration f-l
Conduct longevity tests by running a low or normal level of work over a long period of time. This identifies the extent of variation and spikes.
Conduct failover tests by stopping various processes while running various levels of load. Such actions should result in redundant components taking over for the primary nodes. This also includes failback to make sure that work resumes to normal after components come back online.

III. Stress on the common database machine m-n Measure the number of bytes between client and server.
Execute the most resource-intensive business processes to obtain database machine CPU utilitization metrics at various levels of application load

IV. Capacity of the configuration o-q Conduct stress tests to measure user response time (and errors) at various levels of application load This determines the number of servers needed, which impacts product pricing. From this extrapolate the lead time and trigger point for upgrades
Continue overload tests running after server recycles or shuts down to see if the application can automatically recover after being flooded.

V. Resource Utilization r-s Repeat Stress and Longevity Tests to determine the impact of various tuning options (such as application software versions, utilities, OS settings, JVM settings, etc.).

VI. Capacity for growth t-w Conduct Scalability tests by repeating tests for each configuration.
Conduct volume tests by running a large amount of data to ensure that the system can accomodate them at acceptable speed.

Project Budget Performance

ID Task Name BCWS BCWP ACWP SV CV EAC BAC VAC

1.

This Earned Value Report was created using Microsoft Project applying general project management practices for performance/capacity planning.

The baseline developed from project planning efforts provide an "early warning" indicator for the percentage completion progress of the effort.

Download the MS-Powerpoint file used to create this graphic

Issues (Discoveries and Recommendations)

This table presents the concerns which initiated the project together with the subsequent example observations and discoveries found during load profiling and analysis.

Concern Possible
Discovery Description / Analysis Recommendation

VII. Extent and Efficiency of Testing Effort (Testability) In HTML, no differentiation of rows for counting among different tables. In order to obtain a count of items in different tables on the same page, a unique identifier is needed for each type of row. Add a unique CSS class= attribute to each type of row. This is usually a design requirement.

I. User Productivity File returned to client is more than 500,000 bytes. This guarantees long response times and potential timeouts. Pre-cache files in smaller pages hidden in sign-up pages or download in background.

Use of UTF-8 ContentType for English-only pages. Additional time is required to process vs. ISO-8559-1. Program specification only for pages which need it.

No indication that system is working during long processes. Users are likely to abandon the session, click refresh, or other actions which cause even more load. Show a "searching... please wait" screen for responses known to be over 5 seconds.

when server is overloaded, users see no screen or technical default text. Cryptic HTTP "500" error is shown when servers are too busy to respond. Show a "Busy ... Please Try Again Later" screen to users who are not allowed to login due to server overload.

The first user of the day experiences long response times. Servers wait until users request specific transactions before loading them into memory, a task which may take several minutes. When server services start, automatically load programs into memory by configuration settings or invoking fake users.

Users must make the same filtering selections repeatedly. Values to filter data specified by each user are not presented again. Retrieving data that users discard consumes CPU, memory, network, and other resources. Filter out data that users usually don't want.

III. Stress on the common database machine Server error after 5 minutes. JVM diagnostics graphs showed that memory peaks at 250mb. This is the default value. Specify -xmx:2000m among JVM startup parameters.

Server error after 15 minutes. Parallel graphs of diagnostics showed that the number of Weblogic sessions flattened out at 250. This is the default number. Since the timeout is 20 minutes, runs require 35 sessions per user per minute. Specify the maximum number of sesssions in the config.xml file.

High disk utilization. 10GB of disk space is consumed per hour of peak load. In productive system simulations, use "Error" level logging.

Maximum app loads did not overload the DB server. The major concern of this project was the impact on the Oracle machine. Runs at the largest application volume increased CPU utilization by no more than 25% with AP transactions, which had the most impact on the server. Identify and test for the total possible load on the DB running all apps at possible peak loads.

II. Operational Efficiency: Stability of the configuration (Readiness of the app. for production) An image file was not found on page "Xyz". Microsoft browsers automatically request the favicon.ico file, which generates an error if it's not on the website's root folder. Workaround: Script loadtest to ignore the "404" error. Root cause: Provide the file with the name expected by the app code or change the code.

II. Efficient
Resource
Utilization Spikes in performance. Longevity tests confirm that spikes in response time were eliminated after changing the JVM run-time setting in the server start-up to specify a) more memory to permgen, b) availability of multiple processors, c) incremental garbage collection.

Server shutdown during overnight runs. The server shutdown near the end of Longevity tests because it ran out of file handles. When the OS was setup with the maximum rather than the default number of file handles, the app completed longevity tests. An additional temporary workaround is to recycle each process once a day. Workaround: Configure the OS with more file handles/descriptors.
Root cause: Change app code to explicitly close files.

To better manage follow-up, action items may be entered into a "defect" tracking system or task/project management system.

Business Processes

Business Process #
Steps Iteration
Time1 TPM
/User Peak#
Users Max#
Users

0. LL Login / Logout 2 12 s 0.060 300 500

1. AP Accounts Payable [6 lines] 23 8.5 m 0.160 10 15

2. JE GL Journal Entry [14 lines] 6 10 m 0.001 10 20

3. RC GL Report Creation [1 acct] 5 24 s 0.160 25 50

4. RR Report Retrieval 3 12 s 0.260 50 100

5. EV Employee Expense Creation [4 lines] 2 10 m 0.360 100 400

Combined 41 42 m 1.001 300 500

# Steps provides for each business process a count of its user dialogs — the number of "round trips" to the server after the user clicks a submit button or a link. This link provided with the number is to a list of the dialogs and the names of transactions measurements.

Iteration Time1 is the total amount of time needed to complete all steps of the business process. (This can be obtained from VuGen during load script development).

TPM /User (Transactions Per minute per User) is the TPS (Transactions Per Second) multplied by 60.

Peak# Users is the peak (largest) number of users that may perform that process all at once, such as (in the case of login) each work-day morning, and (in the case of business processes) around each accounting period-end.

Max# Users is the maximum number of users that can possibly use the specific process all at one time.

Script Lines/Function Points

How many features/lines (or function points) should be/are in scripts created to conduct load test runs

Some development managers use the number of lines of code as a rough metric to measure the complexity of systems and the productivity (efficiency) of developers. This is controvertial because better developers make use of reusable libraries and coding techniques to create more robust systems, but take more time to create.

Test Run Length

How long should/do test runs take

The turnaround time of a run is determined by the ramp-up and run-length strategy, which are different for each type of test.

At the beginning of testing efforts, time to develop load testing scripts can lengthen test turnaround time.

This time is tracked by recording timestamps when the notification of a configuration change is received and the time test results are published.

The amount of time needed for each test run should include the amount of time needed to manually prepare run conditions (such as running a program to reset data values) before each run as well as the amount of time to manually collect run logs and analyze run results.

Automation of run log collection can speed this up.

Number of Test Runs

How many runs should it / did it take

Organization which rate themselves high on CMMI should have this figure as an outcome of planning efforts, with expected numbers coming from an analysis of metrics gathered from previous similar projects.

Observations

These are good candidates for "Six Sigma" improvement projects.

Several metrics that affect the performance and capacity of an application can be obtained even before load testing runs are completed.

These metrics need to be measured manually, with a stopwatch Or maybe a calendar

Environment Complexity

How quickly can configuration changes be made

We use a test log spiral notebook to record:

the timestamp of the email requesting a configuration change to be identified.
the timestamp of the email requesting a test run for a particular configuration change.
the time (work hours or days ) calculated between these two times

Backup Imaging Time

How long does it take to backup disk images off servers

This affects the amount of time for testing.

Image Restore Time

How long does it take to restore disk images on servers

Recovery/Reboot Time

How long do servers take to restart servers

We use a test log spiral notebook to record:

the timestamp of when the command is issue on the first server
the timestamp of when "ready" is issue by the last server to start
the minutes calculated between these two times

Failover Time

How quickly can each component failover to redundant servers

This is deteremined during Failover testing.

Failback Time

How quickly can the system failback from redundant resources

This is deteremined during Failover testing.

Graphs and Dashboards

graph customization

dashboards

Placing several charts on the same ("dense") page allow interrelationships to be more easily identified and analyzed.
Performance and capacity measurement projects are not one-time projects, but an on-going endeavor.

Using Microsoft Excel is a two-edged sword. I prefer it because it is the most common package. I don't have to beg my employer to buy it to get my work done. Since some companies may not want to pay for it, I may be stuck using Excel anyway.

However, because of its power, Excel can be difficult to master. But I can show you how it can be done. After an initial investment of a few hours, you would develop an impressive skill that you'll take with you.

these instructions on how to use Excel to create dashboards

Analytics

data visualization

"Visual analytics" apps work by reading data from ordinary Microsoft Excel spreadsheet files into files that PowerPoint, PDF, and Flash-enabled web pages use to enable interactive exploration of graphic data — automatically switching graphic presentations in real-time response to variables specified by moving slider bars, accordian menus, and other "spiffy" user interfaces.

Packages from several vendors enable wider sources of data, such as XML from web services and direct connection to Oracle or SQL databases.

$395 Xcelsius from Informersion, one of the Crytal Reports products by Business Objects.
Actuate
Bright Point

For consistency:

vertical

time frames

horizonal sliders or circular speedometers (like the iPod wheel) are used to specify various levels of load on the servers.
Slide to the lowest point for results associated with the minimum configuration.
Slide to the highest point for results pertaining to the largest configuration tested.

pull-down selectors are provided (instead of sliders) to specify non-continuous items such as departments.

�

Results from Each Measurement Run

Information in the "Raw Speed" table below displaying performance results were collected from the start of script development efforts:

Imp. (Importance) provides the Importance of the dialog. The "MoSCoW" approach uses these designators:
- Must have (a mandatory requirement required for basic operations).
- Should have.
- Could have, if time permits.
- Won't have (restrictions on features which might pose security risks, etc.)
Manual Step describes the title of the page that should be returned after users take action.
Mix is the percentage of iterations in which the action is expected to participate.
Think Time is the amount of time experienced users typically need to perform the action. The standard times are:

Our scripts are coded so that statistics are captured for each action run with a single user.

Bytes 1 received from each page,
Speed 1 is the milliseconds to respond to each page. 1000 milliseconds equals 1 second. This number answers the question:

These numbers can potentially be used by load scripts to detect anomalies in responses during runs, such as issuing a message if less bytes are downloaded than expected for a particular page.

�

Raw Speed

What is the fastest system response time that users can expect from each dialog/screen/function of the application

The contents of this table is described in the above section From LoadRunner Analysis
Summary Report

Imp. BP Manual step (Use Case) Mix Think
Time Trans.
ID Bytes
1 Speed
1 Min Avg Max SD CV

Must LL 1. Invoke homepage URL 90% -- 1_InvokeURL 43212 3212

High LL 1.2 Home on main menu
for "Employee facing registry page" 40% -- 2_ 43212 3212

High LL 1.3 Logout 20% 2 9_ 43212 3212

High TS 3.1 Time sheet Menu link 40% 2 TS01 43212 3212

High TS 3.2 Lookup Dept 22% 6 TS02 43212 3212

High TS 3.3 Time sheet Entry Submit 38% 6 TS03 43212 3212

To better visualize the statistics, this barchart ranks transactions. For each item:

The maximum (longest/slowest) time observed during the run is illustrated with a red bar.
The minimum (shortest/fastest) time observed during the run is illustrated with a blue bar.
The average time during the run is illustrated with a light-green bar.
The median time during the run is illustrated with an dark-green bar.

This graph should be generated for a run at a single pace (the same number of virtual users) throughout the run.

Impact of User Errors

How fast does the application detect, report, and recover from various user errors

Examples:

Application invocation with missing resources
For each type of user (user role)
- Registration
  - username already used
  - email address not supplied
  - inadequate password

To augment the Summary Report generated by LoadRunner's Analysis program, I copy and paste it onto an Excel spreadsheet, then

Highlight the screen/step with the hightest variation by adding a "Coefficient of Variation" column calculated by dividing the average into the Standard Deviation. So the larger the ratio, the greater is the variation relative to the average.
Highlight screens/steps which have response times higher than a threashold of 2 seconds by adding a flag
Merge numbers from the sheet with the table above for a consolidated presentation.

Consistency of Response Time Speed

How frequent do spikes in response time occur

This line chart presents the results of a run at the same conditions over several hours.
Data values for these types of charts need to be presented at the lowest granularity (such as once per second, as shown here). Otherwise, individual spikes would be averaged in and thus not appear.
The mean time between failure (MTBF) statistic is calculated by dividing the number of spikes observed into the length of the observation period (such as 8 hours).
To analyze why, we drilled down to the small time frame specific to when the "blip" occured on various servers.
Contention testing is often necessary to identify occassional spikes in response time.

Speeds at Various Data Loads

How much degradation in response time can be expected as the application processes larger amounts of data

This question is answered with Data Volume Testing, when the maximum amount of data expected is loaded on the system so that its impact can be measured.

Volume testing is especially important to measure database performance because different size datasets require different indexing and caching strategies for maximum efficiency. Adding indexes to large datasets is the most common approach to improving performance from databases. On the other hand, indexing a small and frequently referenced dataset can actually slow processing speed. More on Oracle database architecture and performance

Speeds at Various User Loads

How much degradation in response time can be expected as the application gets busier (process more users/transactions simultenously)

Two approaches to running Stress Tests were used to answer this question:

Gradually increase the load by increasing the number of simultaneous users until the server chokes. The results of this approach is shown on the first graph to the right.

"Stair-step" constant loads for a certain amount of time. The results of this approach is shown on the second graph.

Use of a third-party tool such as Excel to present information provides the freedom to use the Median rather than the Average presented in standard LoadRunner Analysis reports.

This more sophisticated (some may say overly complex) visualization is this "High-Avg-Low" chart (formatted using MS-Excel) provides averages, medians, and variation statistics at each level of load (rather than combined together as with the first type of run).

Statistics from the first type of run is less useful because by default run averages include the spikes at the end. Data values can be filtered to the specific time period of interest. But "ramp-up" effects are included at every point.

Results from stair-step type runs are more realistic to actual patterns of usage. More importantly, the stair-step approach provides information about the variability of response time at various steps.

Drop-down selections (or Forward and backward buttons) are provided to see the impact from varying run conditions (such as different configurations, different versions of software, different instllations of hardware, etc.).

Click to open larger image in a new window

Run Longevity

How long can the application run before needing manual intervention

Conducting a longevity run over 22 hours identified the avarage response times in this graph.

If there is only time for only one run, this statistic should be obtained from a run at high load level (but still sustainable) loads.

The Variability statistic is measured using the standard deviation calculation

The curiosity here is whether there a statistically valid trend to responsiveness improving or degrading over time.

To analyze trends, we can use accordian menus to view consolidated and detailed views of specific time frames.

Data Transfer

How many bytes are sent back and forth between client and server?

Resource Consumption Dashboard

How many resources (servers, memory, disk space, connections, file handles, etc.) does the app consume

Memonic Resource
Metric Tier 1:
Load Balancer
/Proxy Tier 2:
Web server
(L23x1) Tier 3:
App server
(abv123) Tier 4:
DB server
(bruser1)

Brain: Memory:
1023 MB
209 MB
809 MB
1209 MB

Arms: Disk
Paging:
55
20
40
58

Legs: CPU Util:
55%
20%
40%
58%

Feet: Network
Bandwith:
55 mbps
20 mbps
40 mbps
58 mbps

The table here presents measurements horizontally one column for each system tier/machine.

Vertically, the different metrics are arranged according to portions of a running person shown here as a memonic for the metrics.

Values are recorded during runs at peak capacity.

Typically, the highest utilization of a particular resource in one tier/machine is the bottleneck that limits the capacity of the entire system.

With an interactive chart, one can drill-down to specific components of processes that are consuming resources, by clicking on that bar.

Resource Consumption Alerts

At what level of resource consumption should operations be alerted for manual intervention

From a 1MB Powerpoint 2003 slideshow containing voice narration:

Capacity Metrics Analysis

Sample action conclusion/recommendation:
Additional capacity is needed by December 1st to meet new peak usage anticipated.
To meet this growth, we will need to begin work to add an additional __ servers on November 1st this year, a one month lead time.

Sample analysis:
This conclusion was calculated based on these findings:

At present, peak usage is 38 simultaneous users (on each of 4 app servers).

78 simultaneous users with no "think tiime" (on each of 4 app servers) is the point where performance degradation becomes noticible at 4 seconds response time.

Our user base is growing at the rate of 30 "simultaneous users" per day.

So we will reach our peak number of users in 40 (78 - 38) days, which is November 1st.

It takes a maximum of 10 days to order, install, configure, test, and integrate a new server.

So we need to begin the ordering process on (10 days before December 1st).

Throughput (Capacity) Per Second vs. Response Time (Speed)

At what load does the application process at unacceptable response times (e.g., over x seconds)

This chart illustrates (rather typical) behavior: Click to open image in a new window.

The analysis:

At a moderate load of 35 simultaneous users processing 8.8 transactions per second, response time doubles to 0.8 seconds, but it's that acceptable performance.

At a high load (such as 100 simultaneous users in the above example taking response time to 8 seconds), the server can complete transactions about as fast as when it had less transactions.

This indicates that the system has reached its point of "job flow balance", the tps rate which requests are processed as quickly as they are received. Requests arriving faster at this rate would be queued.

The Mercury Capacity Planning (MCP) Visualizer module displays simlar data using this format, providing a pull-down manu to quickly access charts by individual business function:

Calculating Load and Five Considerations For Large Scale Systems

At what load does the system reject transactions

If the rate of requests is relentlessly beyond the job flow balance rate (such as beyond the persistent rate of 200 simultaneous users in the above example), eventually the server runs out of queue space. If it can't allocate less time to each user, it then returns errors or even shuts down.

Point of Overload Failure

At what overwhelming load (or other conditions) do servers shut down

This can include denial of service type attacks.

The Bottlenecks

How quickly do the servers/services restart on their own after being overloaded by a sudden overwhelming load

In production, if servers can restart automatically on their own, much time can be saved rather than assuming that system administrators have the diligence and the time to always watch the systems.

Servers which require another server (such as a proxy or database server) to be up before it starts should be initialized with a process that checks the availability of those other servers and wait until they are available.

Comparison Regression Test Results

What is the impact of changes / tuning options (such as application software versions, utilities, OS settings, JVM settings, etc.)

This type of testing requires use of statistically valid methods to measure the likelihood that results occured due to chance.

This approach requires the export of all "data points" (statistical "response variable") from LoadRunner to a statistics application which can generate a statistical presentation.

Each setting that can be changed (such as an individual server configuration setting) is statistically a single input "factor", also called categorical variables or "treatments".

Runs (trials) at the low, medium, and high "levels" of a factor (such as server setting) are considered three statistical "groups" of data.

The calculation technique depends on the number of factors and groups:

A t test is used to determine whether there is a statistical difference between just two levels of a single treatment. This is also called "Student's t".

"One-way" ANOVA (Analysis of Variance) is used to statistically calculate the significance of differences in performance numbers after changing a single setting (such as an individual server configuration setting) — statistically a single "treatment" or input "factor" — runs at 3 or more levels (such as low, medium, and high value).

The conclusion for a statistical difference among groups is called an "F Test" (named after Ronald A. Fisher who during the 1920s and 1930s pioneered the t-test for comparing just two population means). The F test is based on the F ratio of differences in the "variation" between groups over variation within each group (considered as statistical "random error"). The larger the F ratio, the greater the difference.

The point where a specific F value becomes statistically significant is when it is larger than the critical value defined in a "F Distribution" presented in a static (paper) "F Table" or dynamically calculated by a program such as the Statistical Distribution Calculator client. Critical F values are adjusted for the number of groups and number of observations for a given confidence level (usually an "alpha" error of 0.05 or the stricter 0.01).

Source of Variation SS df MS F p

between 3 groups 64 2 32

within 69 observations 68 21 3.24 9.88 <0.01

total 132 23

This table reflects the application of this formula of calculations needed to normalize postive and negative differences together and to adjust for the statistical impact of a small number of observations.

Each "Mean Squared" (MS) value used in the F ratio calculation is the variance for a group of observations. It is calculated from dividing the sum of squares (SS) (squared deviations about the mean, called the variation) over N-1 degrees of freedom (df). "p" is for the confidence interval, called "alpha".

For more information on ANOVA:

Wikipedia
Statsoft
SigSigmaFirst.com describes calculation of results from 3 Methods for input into Excel's Data Analysis Add-in for Anova:Single Factor
Calculating Interactions using SPSS
http://www2.chass.ncsu.edu/garson/pa765/anova.htm
http://www.sportsci.org/resource/stats/ttest.html

"Two-way" ANOVA is used to examine the effects of two factors, both together as the "Main Effect" and individually. [SAS code for 3-way]
Excel supports this with Data Analysis "Anova: Two-Factor With Replication" and "Anova: Two-Factor Without Replication"
Repeated Measures (paired t tests) for longitudinal studies.
Excel supports this with Data Analysis "F-Test Two Sample for Variances".
Measurement of several changes between runs (such as changing both software version and hardware configuration) would require MANOVA (Multivariate Analysis of Variation) techniques to test several independent variables (factors).

Different statistical applications display results from ANOVA as a "BoxPlot" or Box and Whisker plot arranged horizonatally or vertically on the dependent variable (such as the number of response time in seconds, etc.)

median

lower and upper quartile

Microsoft Excel users can use a "Volume-Open-High-Low-Close" chart format to approximate a BoxPlot/Box and Whisker Chart.

Effect size is the difference between two groups stated as a percentage of a standard deviation (i.e., �14%�). It is the appropriate statistic for gauging the importance of comparisons. The guidelines are:

Load Balancing Test Results

How well do provisions for balancing load allocate transactions evenly among machines
Does one J2EE application server instance in a cluster perform more work than the others?
On which database instance are the most transactions executed?
Sample Conclusion: Multiple machines are needed to realisticly create a large enough load to stress utility servers and services (such as to handle LDAP authentication, email, and database requests). Such servers are configured to serve many applications.

Upgrade Regression Test Results

Can the system scale "horizontally" when a server is upgraded with higher capacity components

This is a question of the cost effectiveness and stability of adding RAM or replacement of faster components (such as a faster I/O device or a motherboard with a larger number of processors/CPU's),

The conclusion is whether the application and utility software are programmed or configured to take advantage of the additional hardware.

Scaleability Test Results

How well does the system "scale vertically" when more machines are added
A 3 axis (3 dimensional) chart is necessary to illustrate the interrelationship of both 1) load and 2) number of machines on 3) response time.
The format of this 3D surface chart (created using Excel) is based on the one in Neil J. Gunther's book — presenting the performance response resulting from various number of "m" machines/processors running at various levels of load (load factors such as "0.80" for 80% CPU utilization).
This 3D surface chart was created by inputting the results of 25 separate runs into an MS-Excel spreadsheet.
1. When the machine operates with 2 m's (along the back line), performance is 4 seconds for a single user and 6 seconds for 300 users.
2. When the machine operates with 10 m's (along the front line), performance is 1.5 seconds for a single user and 2.5 seconds for 300 users.
3. In between these two extremes above, such as when the machine operates with 6 CPU's, performance is under 3 seconds when servicing 100 users.
The conclusion from all this is that to maintain response times under 3 seconds, the choices are:
- Use machines with 8 m's and keep the load on any individual server under 280 users.
- Use machines with 6 m's and keep the load on any individual server under 80 users.
Furthermore, if 3 seconds is indeed the threshold:
- Machines with only 2 or 4 m's should not be used.
- A 10 m machine may be "overkill" if real loads do not reach a peak of 300 simultaneous users.
Within MS-Excel, 3D chart's Elevation, Rotation, and Perspective can be adjusted using this dialog under Chart, 3D View...

A requirement related to this can be stated as "multiprocessor effect (ME) of no more than 75%", which means that there should be at least a 75% improvement after a 100% (doubling) of hardware.

Reserve Capacity

How much longer can business volume grow before reaching a point of degradation
This type of testing is needed to provide assurance that scalability tests can be performed.
See Capacity Metrics Analysis section above.

Production Availability

What is the trend in response time over various ranges of time

How does our response time compare against our competitors over various ranges of time

Signal Processing of "noise".

Production Monitoring

Comparing results over time reveal how an application is performing versus how the application was expected to perform. Deviation from expectations should trigger an alarm message. The actual rate of user errors could impact the type of load on the system.

Commercial Quality of Service Performance Management products for Network Operations Centers (NOC):

Keynote.com is a service that measures network uptime and performance.
AlertSite in Boca Raton, Florida monitors web sites and services.
ProactiveNet collects data from Business Service Management applications (such as IBM Tivol, BMC Patrol, CA Unicenter TNG, etc.) and performs real-time analytics to pin-point capacity choke points.

Performance Management

Foundations of Service Level Management (Indianapolis, Ind. Sams Publishing, 2000)

Daily alerts focus on up-time operational status kept online for the most recent two weeks are:

Outage report by application by location
Response time report by application by location summarized at 15-minute intervals for the prime shift, and at 30-minute intervals for the off-shift
Problem reports by priority, including a brief description of the problem for critical and severe problems
Average problem response time by priority
Problems closed and outstanding by priority
Security violations and attempted intrusions

Weekly volume reports focus on operational volumes Kept online for the most recent eight weeks are:

Workload volumes by application summarized by shift by day
Outage summary by application by shift by day
Recovery analysis for all outages of significant duration
Cumulative outage duration for the month by application
Response time percentiles by application
Security violations and attempted intrusions

Monthly project reports focus on progress toward completing projects. Contents kept online for six months include:

Report card summary
Workload volumes by application
Service level achievement summary by application service
Highlighted problem areas and analysis

Quarterly trend reports focused on overall satisfaction and structural trends and major initiatives. These include:

Workload trend report by application and user community
Customer satisfaction survey results
Service level achievement trends
Cost allocation summary
New IT initiatives

�

Software Test & Performance Magazine

Search for word:	feeds for Performance and Capacity Engineers...
		Performance Testing Performance Project Planning Performance Tuning Load Testing Products Hyperformix Capacity Modeling Mercury LoadRunner Mercury Diagnostics VuGen LoadRunner Scripting Quality Management Reporting Bugs/Defects The Software Testing Industry

Your first name:

Your family name:

Your location (city, country):

Your Email address:

Email updates (subscribe)

Top of Page

Thank you!

Human verify:

Please retype:

Visitor comments are owned by the poster. Our Privacy Policy
All trademarks and copyrights on this page are owned by their respective owners.
The rest ©Copyright 1996-2011 Wilson Mar. All rights reserved.
Last updated | HTMLHELP | W3C XHTML | CSS | Cynthia 508