How I may help
LinkedIn Profile Email me!
Call me using Skype client on your machine

Reload this page Cloud Services Performance

This page analyzes the work of performance testers in cloud computing platforms.

This is one in a series on Clouds and Windows servers:

 

Topics:

  • Why? The Need
  • The Trouble with Clouds
  • Metrics for Managers
  • Tech Metrics (CPU)
  • SAN/EBS Disk Perf
  • More Resources
  • Your comments???
  • Site Map List all pages on this site 
    About this site About this site 
    Go to first topic Go to Bottom of this page


    Set screen Do Cloud Environments Need Performance Testing?

      I once heard a used Hummer car salesman say to me "in these tanks you drive further without adding gas."

      This is while we're standing next to the federally-mandated MPG sticker which publicizes the car's poor gas mileage comparison.

      What the salesmen doesn't point out is that "range" is achieved only by a larger gas tank, not by better gas mileage, which results in higher TCO (Total Cost of Ownership).

      We are hearing similar hyperbole now with the next big thing in IT -- clouds.


    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Paradigm Shift

      Working within a cloud requires a paradigm change in thinking. As with any paradigm change, we need to give up some concepts and adopt other ways of thinking.

      For example, CPU Utilization is not totally irrelevant in a cloud, but not the main metric they used to be. Instead, CPU MHz Cycles used is key because it can be used for billing and capacity management.But it's captured only by VM ware and other virtualization software.

      A machine with the same hardware specs and software may exhibit different performance characteristics due to the way data is stored and retrieved (from another server rather than on the same server's bus).

      The speed which a server can be brought up from nothing is now a key metric, when earlier it has been assumed that such an effort takes weeks.

      Another paradigm shift is going from benchmarking models assuming constant conditions to probablistic models assuming constant change. More than ever, performance engineers need to become statistitians (I can't even spell the word ;). Well actually, I have been a certified Quality Engineer for many years. But I still can't spell. ;)

     

     
    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen The Trouble with Clouds

      Being able to mount any number of server instances is different than instant magical increases in computing capacity for an unattended application.

    • When a new instance is started, the server needs to be configured and applications loaded. This can be a time-consuming, error-prone process to do, test, and automate.
    • Cloud providers still have relatively unsophisticated approaches to add or tear down servers (such as percentage over 70% CPU more than 10 minutes). This means manual monitoring or development of a smarter program that automatically analyzes prior traffic and creates instances before they are needed.
    • Load balancing requires deployment of an additional server instance needs to be setup (at an additional cost) to receive all traffic and routing it to an open slave server. This server becomes a single point of failure and need to be monitored.

      GoGrid provides a F5 load balancer appliance for free as a competitive advantage. Unlike a regular server, the F5 appliance is purpose-built to efficiently handle load for a lot of servers.

    • Cloud vendors may not offer the latest versions of software. Amazon EC2 currently offers images containing Windows 2003 while Microsoft Azure offers 2008 servers.
    • Cloud vendors may not offer adequate performance monitoring capabilities other that what is measured for billing.
    • GoGrid does not yet offer custom images. This adds additional time when servers are costing money but not available for actual work while they are being configured.
    • Some vendors (GoGrid) charge for a server even when it's not running.
     

     
    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Metrics for Management

      So, based on the above, here are ways that executives can assess the operational responsiveness of their cloud services:

      1. How many servers need to be configured (including auxillary servers for load balancer, etc.)?

        This is the Server Count metric. Amazon users should breakdown servers by size for each service (batch MapReduce machines separate from standard servers). As with load balancing, MapReduce requires a master which divides data for distribution to multiple slave instances (within a separate internal security group).

      2. What is the trigger to add or remove servers?

        This tipping point metric is perhaps the trickest question because this is an art, even with mountains of monitoring data and graphs showing historical trends. A simple algorithm to add servers such as "when CPU ..."

      3. How much time does it take between the decision to add a server and it being fully functional?

        This is the Lead Time metric. This need to include ALL effort to getting a server fully productive, not just the 15 minutes to provision the server so it can be configured.

      4. Amazon EC2 Loading Instance

      5. How do we determine the most appropriate size of server to provision?

        The decision to go with a larger one depends on how much effort it takes to configure each server and the cost of unused capacity as well as the cost of running servers. This calculation should not be made purely on what vendors charge.

      6. What is the total cost of running the application?

        This is the cost per transaction bottom-line metric. Industry-standard benchmark applications and processes are needed to provide comparison. Existing benchmarks are focused on individual machines.

      7. How much data can be lost if a server goes down suddenly?

        This percentage vulnerability metric is part of a larger transaction robustness or availability analysis with disaster recovery scenarios.

      8. How much effort does it take to create and deploy an application release?

        This deployment efficiency metric is important becuase it occurs often. The need to go through a Remote Desktop may hinder productivity for some.

      9. How much time will it take to move to another cloud vendor?

        This organizational flexibility metric is important at a time when the largest automobile manufacturer and largest insurance company in the world can go bankrupt in the same year and entire cities can be crippled by terrorists within minutes.

        Organizations also should be ready to take advantage of the price war that is sure to occur among cloud vendors.

        Minimizing vendor lock-in is a serious consideration. The cloud hosting industry is new and consolidations are sure to occur. A vendor may make changes to its technologies, customer service, or pricing that make using them no longer attractive.

        Being in a position to be able to make a move provides negotiating strength for users. It's lett frustrating for users who are in a "roach motel" or "Hotel California" situation (where you can check-in but you can't check out).

        Even if you stay with the same vendor, the process of getting prepared to move also enables smoother and quicker migration from one version to the next.


    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Technical Metrics - CPU Performance

      CPU Utilization is not totally irrelevant in a cloud, but it's not the main metric as on bare-metal dedicated machines. Instead, CPU MHz Cycles used is now key because it is now the basis for billing and capacity management. But it's captured only by VMware and other virtualization software (not by Microsoft's Perfmon or Linux SAR or TOP).

     

     
    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen SAN (AWS EBS) Disk Performance

      Because EBS disks are shared, their speeds can vary greatly.
      "It feels a bit like trying to clock the speed of passing cars with a radar gun from the back of a rampaging bull."
      —heroku.com
      Most people make use of EBS to persist data, so this is an important part of the performance picture.

    • Measuring only end-to-end performance does not provide enough information to evaluate the speed offered by a cloud vendor. So lower-level measurements are necessary on an on-going basis.

      heroku.com reports that "Under perfect circumstances a totally untweaked EBS drive running an ext3 filesystem will get you about 80kb read or write throughput and 7,000 seeks per second. Two disks in a RAID 0 configuration will get you about 140kb read or write and about 10,000 seeks per second."

      Anecdotal comments are that EBS performance maxes out at 20 to 30 disks, not at the 40 disk limit.

      RAID Chunk size of 256k seemed to be the sweet spot. This makes a (shockingly) HUGE difference in performance.

      Read ahead buffer on the raid of 64k (from 256 bytes) made a HUGE difference.

      Filesystem JFS (IBM's journaled file system open-source technology) and XFS improves performance more with EBS than the default EXT3 on dedicated servers.

        BTW, some say JFS offers lower CPU usage and better performance and reliability than XFS, with EXT3 not handling large files as well. But it would be interesting if this benchmark is rerun within Amazon's cloud. It's conclusion is that "XFS appears as a good compromise, with relatively quick results, moderate usage of CPU and acceptable rate of page faults." However, other bloggers comment that "XFS does not handle crashes or power loss nicely." (The sequence of jounaling in XFS can result in data loss)

        I haven't read much about VxFS free from Veritas Storage Foundation Basic.

      IO scheduler deadline or cfq (but not noop) improved performance (but not as much as thought).

      Among the IO Benchmarks:

      • bonnie++ by Russell Coker (download) conducts simple tests of hard drive and file system performance. The program now has a -b option to cause a fsync() after every write (and a fsync() of the directory after file create or delete). Blocking writes is useful to test performance of mail or database servers. Random Seeks and Random Create are more critical performance metrics than Sequential Output, Sequential Input, or Sequential Create.


    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen File Transfer (S3) Performance

      When does it make economic sense to serve files from S3 to all versus serving them from a distributed set of servers?

     

     
    Go to Top of this page.
    Previous topic this page
    Next topic this page

    Set screen Resources

     

       
     
    Go to Top of this page.
    Previous topic this page
    Next topic this page
    Set screen


    Go to Top of this page.
    Previous topic this page
    Link to Performance Engineer More for Performance Engineers...

      rob


    How I may help

    Send a message with your email client program


    Your rating of this page:
    Low High




    Your first name:

    Your family name:

    Your location (city, country):

    Your Email address: 

      Top of Page Go to top of page

    Thank you!

    Human verify:
     
    Please retype:
     

    Visitor comments are owned by the poster.
    All trademarks and copyrights on this page are owned by their respective owners.
    The rest ©Copyright 1996-2011 Wilson Mar. All rights reserved.
    | | | |