WINDOWS PERFORMANCE MONITORING TIPS WITH SPLUNK

At Octamis we love Splunk, and we love to share our knowledge and experience, so let’s study some tips on Windows monitoring with Splunk !

PREPARING YOUR SPLUNK

Let’s proceed in the order, we want first to get Splunk ready to receive Windows performance data.

This is quite simple and relies on deploying the Windows technical add-on built by Splunk:

https://splunkbase.splunk.com/app/742/

Depending on your Splunk architecture, ensure to deploy the technical add-on everywhere it is required:

  • Indexers (clustered or standalone)
  • Search heads
  • Intermediate forwarder on the path to your indexers, if any

The add-on has no inputs activated by default, at this point there are no modification required.

Indexes creation:

The Windows technical add-on contains the embedded definition for a few indexes:

  • perform: dedicated index for Windows monitoring data using the perfmon
  • windows: dedicated index for various log data and monitoring not related to perfmon data
  • wineventlog: security and log data

If you are running a standalone server then you have nothing else to do as the indexes have created for you by the add-on.

If  you are running Splunk with clustered indexers, be sure to declare those indexes properly before continuing the setup.

DEPLOYING AND CONFIGURING THE ADD-ON

For the demonstration purpose, we will assume that:

  • You already have servers running with the Splunk Universal Forwarder
  • The servers are connected to your Splunk indexer(s) and properly configured for Splunk indexing
  • The servers are connected to a Splunk deployment server (recommended) or you use your deployment solution

Deploy the technical add-on as usual and continue the setup.

COLLECTING PERFORMANCE DATA

Let’s have a look at the default “inputs.conf” file provided within the technical add-on, since we focus on performance metric, we are only interested for now in the “perfmon” stanzas.

For the the demonstration purposes, let’s have a look at CPU and memory metrics:

Splunk_TA_windows/default/inputs.conf

## CPU
[perfmon://CPU]
counters = % Processor Time; % User Time; % Privileged Time; Interrupts/sec; % DPC Time; % Interrupt Time; DPCs Queued/sec; DPC Rate; % Idle Time; % C1 Time; % C2 Time; % C3 Time; C1 Transitions/sec; C2 Transitions/sec; C3 Transitions/sec
disabled = 1
instances = *
interval = 10
object = Processor
useEnglishOnly=true
index = perfmon

## Memory
[perfmon://Memory]
counters = Page Faults/sec; Available Bytes; Committed Bytes; Commit Limit; Write Copies/sec; Transition Faults/sec; Cache Faults/sec; Demand Zero Faults/sec; Pages/sec; Pages Input/sec; Page Reads/sec; Pages Output/sec; Pool Paged Bytes; Pool Nonpaged Bytes; Page Writes/sec; Pool Paged Allocs; Pool Nonpaged Allocs; Free System Page Table Entries; Cache Bytes; Cache Bytes Peak; Pool Paged Resident Bytes; System Code Total Bytes; System Code Resident Bytes; System Driver Total Bytes; System Driver Resident Bytes; System Cache Resident Bytes; % Committed Bytes In Use; Available KBytes; Available MBytes; Transition Pages RePurposed/sec; Free & Zero Page List Bytes; Modified Page List Bytes; Standby Cache Reserve Bytes; Standby Cache Normal Priority Bytes; Standby Cache Core Bytes; Long-Term Average Standby Cache Lifetime (s)
disabled = 1
interval = 10
object = Memory
useEnglishOnly=true
index = perfmon

Things you will (should) probably want to customise:

  • “interval” : this is the time in seconds between 2 performance collections, and will influence the volume of data to be generated. 10 seconds is probably quite high, 30 or 60 seconds are good values that save license, bandwidth and CPU footprint on the servers
  • “mode = multikv” : this is a great option introduced years ago (see: https://www.splunk.com/blog/2013/10/28/new-features-for-perfmon-in-splunk-6), this is smart, it saves license, storage and bandwidth
  • “disabled = 1”: This deactivates the input which is the case by default but you need to explicitly activate each input

Let’s with the following configuration, as always do never modify a default file, create a local file and copy only the stanzas you are interested in:

Splunk_TA_windows/local/inputs.conf

## CPU
[perfmon://CPU]
counters = % Processor Time; % User Time; % Privileged Time; Interrupts/sec; % DPC Time; % Interrupt Time; DPCs Queued/sec; DPC Rate; % Idle Time; % C1 Time; % C2 Time; % C3 Time; C1 Transitions/sec; C2 Transitions/sec; C3 Transitions/sec
disabled = 0
instances = *
interval = 30
object = Processor
useEnglishOnly=true
index = perfmon
mode = multikv

## Memory
[perfmon://Memory]
counters = Page Faults/sec; Available Bytes; Committed Bytes; Commit Limit; Write Copies/sec; Transition Faults/sec; Cache Faults/sec; Demand Zero Faults/sec; Pages/sec; Pages Input/sec; Page Reads/sec; Pages Output/sec; Pool Paged Bytes; Pool Nonpaged Bytes; Page Writes/sec; Pool Paged Allocs; Pool Nonpaged Allocs; Free System Page Table Entries; Cache Bytes; Cache Bytes Peak; Pool Paged Resident Bytes; System Code Total Bytes; System Code Resident Bytes; System Driver Total Bytes; System Driver Resident Bytes; System Cache Resident Bytes; % Committed Bytes In Use; Available KBytes; Available MBytes; Transition Pages RePurposed/sec; Free & Zero Page List Bytes; Modified Page List Bytes; Standby Cache Reserve Bytes; Standby Cache Normal Priority Bytes; Standby Cache Core Bytes; Long-Term Average Standby Cache Lifetime (s)
disabled = 0
interval = 30
object = Memory
useEnglishOnly=true
index = perfmon
mode = multikv

Deploy this configuration to your Windows servers, and if you use Splunk deployment server, ensure you check “restart splunkd”.

CHECKING DATA COMING IN

Next step, let’s check for some data coming in:

index=perfmon

Depending on the mode (multikv or not), the data will be available in:

CPU statistics:

  • standard mode: index=perfmon sourcetype=”Perfmon:cpu”
  • multikv mode: index=perfmon sourcetype=”PerfmonMk:cpu”

Memory statistics:

  • index=perfmon sourcetype=”Perfmon:Memory”
  • index=perfmon sourcetype=”PerfmonMk:Memory”

In this article, we will go will the multikv mode.

ANALYSING CPU STATISTICS

Let’s get some CPU statistics:

Per host average CPU usage over time: 

index=perfmon sourcetype="PerfmonMk:CPU" instance=_Total
| timechart avg(%_Processor_Time) as cpu_usage by host

Very simple.

WHERE IS MY PERCENTAGE OF MEMORY UTILISATION ?

If you are “like me”, when looking at memory statistics, the first (and potentially the only) metric you want to be able to retrieve is the percentage of memory being used, or eventually memory free.

So what’s the problem then ? Well, “as it” although we have dozens of various metrics, the percentage of utilisation is not available with perfmon data.

What ???

Hopefully, we can calculate it ! Using Splunk power and features, we can correlate between the inventory data which contains the amount of physical memory available, and the memory metrics available in perfmon.

The following search reports the amount of physical memory in KB:

index=windows sourcetype=WinHostMon
| stats latest(TotalPhysicalMemoryKB) as TotalPhysicalMemoryKB, latest(TotalVirtualMemoryKB) as TotalVirtualMemoryKB by host | sort 0 host

Notes:

This requires the input “OperatingSystem” to be activated in your deployment, using:

[WinHostMon://OperatingSystem]
interval = 600
disabled = 1
type = OperatingSystem
index = windows

For the demonstration, let’s store this result in a temporarily lookup csv file:

index=windows sourcetype=WinHostMon
| stats latest(TotalPhysicalMemoryKB) as TotalPhysicalMemoryKB, latest(TotalVirtualMemoryKB) as TotalVirtualMemoryKB by host | sort 0 host
| outputlookup windows_memory_inventory.csv

Then, looking at the memory statistics, we have the amount of currently used volume of memory in KB, let’s map this with the inventory data and use some easy calculation:

index=perfmon sourcetype="PerfmonMk:Memory"
| eval used_memory_KB=coalesce('Available_KBytes', Value)
| lookup windows_memory_inventory.csv host as host OUTPUTNEW TotalPhysicalMemoryKB
| eval free_memory_pct=((used_memory_KB/TotalPhysicalMemoryKB)*100), used_memory_pct=(100-free_memory_pct)
| timechart avg(used_memory_pct) as used_memory_pct by host

There you go!

Resilient solution:

  • create a KVstore based lookup table to store our Windows configuration inventory data
  • schedule a report to update the lookup table on a regular basis (per day basis for example)
  • create an auto lookup configuration such that it is not necessary to perform the lookup command manually

WHAT ABOUT PROCESSES ?

Understanding a system CPU load requires knowing what and when the processes consumes resources, the perfmon provides processes related data with the “[perfmon://Process]” stanza.

However, for some reasons the perfmon data is not accurate on multi core systems, a nice article gave me the answer I was looking for:

Windows CPU monitoring with Splunk

Based on this great article, let’s add our WMI input to generate accurate processes CPU statistics: (caution: this is a “wmi.conf” and not “inputs.conf”)

Splunk_TA_windows/local/wmi.conf

[WMI:process]
index = windows
disabled = 0
interval = 30
wql = Select IDProcess,Name,PercentProcessorTime,TimeStamp_Sys100NS from Win32_PerfRawData_PerfProc_Process

Once deployed, let’s use some magic searches and start analysing processes activity:

index=windows sourcetype="WMI:process" Name!=_Total Name!=Idle
| reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name
| eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS)
| search cputime > 0
| timechart limit=50 useother=f avg(cputime) by Name

Since Windows will create a new process for a given program able to run in multi core mode, we can improve this search and aggregate a per command invocation basis:

index=windows sourcetype="WMI:process" Name!=_Total Name!=Idle
| reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name
| eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS)
| search cputime > 0
| stats avg(cputime) as cputime by _time,host,Name
| rex field=Name "(?[^#]*)#{0,}"
| stats sum(cputime) as cputime by _time,host,Command
| timechart limit=50 useother=f avg(cputime) as cputime by Command

 

Et voila !

You now have all the main pieces of work to start analysing Windows performance with accuracy, enjoy.

 

WINDOWS PERFORMANCE MONITORING TIPS WITH SPLUNK

5 thoughts on “WINDOWS PERFORMANCE MONITORING TIPS WITH SPLUNK

  1. Hi Guilhem,

    I’ve recently come across this blog and wanted to let you know I’ve found it extremely helpful for process monitoring in my environment. I did come across a small problem that I’m hoping you can assist with. My environment has a few process with spaces in the executable name. The first word in each of these executables is the same and appears to display correctly within WMIC, but once the data is passed to Splunk the raw data is consolidated into just the first word for ‘Name’. Do you know how I could fix this limitation?

    Regards,

    Adam

  2. Answered my own question. For anyone else who has experienced this problem, update props.conf on the Indexer. This overrides the built-in regex for WMI and uses a new regex that breaks on carriage return.

    [WMI:process]
    EXTRACT-e2= Name.(?.*?)[\r\n]

  3. Hi Adam,

    Glad this post was useful to you and you could answer your own question that quickly, well done 😉

    Guilhem

  4. Hi Guilhem! Fantastic article, saved me frustration setting up some performance monitoring.

    Seems the rex in the last example is missing a chunk in the code that is shared.
    | rex field=Name “(?[^#]*)#{0,}”

    needs to be:
    | rex field=Name “(?[^#]*)#{0,}”

    1. Hi Kevin!

      Glad this article was useful to you, regarding your comment about the rex I do not see a difference, was something wrong in it ?

      Guilhem

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top