Desktops for Everyone

View Original

February is Cloud Desktop Monitoring Month!

Talking about monitoring sometimes requires tools to keep you awake, like coffee and enormous speakers.

I’m happy to announce that February is Monitoring Month here at Desktops for Everyone! Monitoring and performance evaluation is a critical part of the User experience, and locking in baselines/acceptable performance levels is a core component of the Cloud Adoption Framework - as I wrote about here.

We’ll go through a series of posts guiding you through what the basics are, then tips and tricks from my years of experience with cloud-based desktops. Let’s not forget my NVIDIA #NGCA days - GPU workloads require a whole different set of monitoring elements! Lastly, we’ll cover non-cloud native providers in the market and what they bring to the table.

The schedule for the month will be:

  1. Intro & Monitoring Fundamentals

  2. Secondary Monitoring Practices/Resources, aka Tips and Tricks!

  3. GPU Monitoring

  4. Going Beyond Native Tools

Get off your phone - resource consumption is surging!

There are a number of common metrics that most admins are familiar with, extending into a few CRITICAL resources that are specific to cloud desktops.

CPU Consumption %:

This typically references a point-in-time value or an average % consumption for the period indicated.

Sample alerting thresholds:

  • Warning: 75%-90% consumption for 5 consecutive minutes

  • Critical: 90%+ consumption for 5 consecutive minutes

A high CPU Consumption value indicates that users or applications are consuming a large amount of CPU resources.   Alerting based on several minutes of consumption prevents individual surges from cluttering a ticketing or alerting system.

Resolving High CPU Consumption %: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. Next, attempt to identify which specific users and processes are consuming CPU resources. Finally, either communicate with the user/s directly and/or connect to the machine as an admin to troubleshoot further.

Memory Consumption %:

This typically references a point-in-time value or an average % consumption for the period indicated.

Sample alerting thresholds:

  • Warning: 75%-90% consumption for 5 consecutive minutes

  • Critical: 90%+ consumption for 5 consecutive minutes

A high value indicates that users or applications are consuming a large amount of Memory (RAM) resources.   Alerting based on several minutes of consumption prevents individual surges from cluttering a ticketing or alerting system.

Resolving High Memory consumption %:  First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. Next, attempt to identify which specific users and processes are consuming RAM resources. Finally, either communicate with the user/s directly and/or connect to the machine as an admin to troubleshoot further.

Processor Queue Length

This typically references a point-in-time value or an average across the period indicated.

Sample alerting thresholds:

  • Warning: Greater than 6 for 5 consecutive minutes

  • Critical: Greater than 10 for 5 consecutive minutes

A high value indicates that a number of actions are lined up for execution by the CPU. End users may report hanging or frozen applications.

Resolving High Processor Queue Length:  First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. If the issue is consistent, then consider adding additional CPU to relieve this bottleneck – even if CPU consumption % isn't high – because per Microsoft guidance, there is only one queue for all CPUs. Generally speaking, a value under 10 (or a value under 2 * CPU QTY, so 16 for an 8 CPU VM) is fine for a short period of time.

Page File Used %

This typically references a point-in-time value or an average across the period indicated.

Sample alerting thresholds:

  • Warning: 50%-75% consumption for 5 consecutive minutes

  • Critical: 75%+ consumption for 5 consecutive minutes

A high value indicates that one or more applications are consuming too much RAM. Memory management is likely offloading RAM consumption in the form of paging, which is manifesting here.

Resolving High Page File Used %: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. Next, attempt to identify which specific users and processes are consuming RAM resources. Finally, either communicate with the user/s directly and/or connect to the machine as an admin to troubleshoot further.

Here are the grand finales of this post, friends…

User Input Delay

User Input Delay is one of the newest perfmon counters (although that isn’t saying much) that is relevant for cloud desktops. It indicates the average amount of time it takes a user’s action to take effect and be shown to them in their virtual desktop consumption for the period indicated. This is often reflected in how quickly the user sees the text they type appear in an email or in a Word document.

Sample alerting thresholds:

  • Warning: 150+ milliseconds for 5 consecutive minutes

  • Critical: 200+ milliseconds for 5 consecutive minutes

A high value indicates reduced performance for end users – this may result in “it feels slow” complaints or “I can’t work” reports in extreme scenarios.

Resolving High User Input Delay: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. This is not related to network latency – it is related to machine performance. Next, navigate to the Performance tab to see which specific users and processes are consuming CPU and/or RAM resources. If this does not yield anything to troubleshoot, look to the following articles to see where else you can dig deep to resolve user challenges. :)

Finally, either communicate with the user/s directly and/or connect to the machine as an admin to troubleshoot further.

Round Trip Time

Indicates the average connection quality of the user’s session (ms between their physical location and virtual desktop) for the period indicated. 

Sample alerting thresholds:

  • Warning: 150+ milliseconds for 5 consecutive minutes

  • Critical: 200+ milliseconds for 5 consecutive minutes

A high value indicates reduced connection quality between the user’s physical location and the location of their virtual desktop. This is an ongoing measurement, making it more relevant than the quality of a user’s login time since that is a point-in-time metric. This may manifest in “it feels slow” complaints or user sessions spontaneously disconnecting.

Resolving High Round Trip Time: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. If the issue is consistent, then consider adding additional Windows 365 Provisioning Policy and/or AVD host pools to resolve geographic-based challenges users are facing. Once you’ve located resources as close to users as possible, any root cause resolutions for these issues are ultimately up to the ISPs responsible for the networks and the hops in between those networks. Tools like Traceroute or Pingplotter can help identify where the challenges lie in the connection at a point in time.

I’m looking forward to digging in deeper next week!