HP-AMS older then DL380Gen8 Hardware

HP-AMS keeps restarting

Problem

A few weeks ago we started to deploy HP Custom Image for ESXi 5.1.0 Update 2 on all our ESXi hosts. Everything seemed to work without problems until a colleague recently discovered in the logfiles that the HP-AMS provider keeps restarting every 5 minutes and gives an error message that it can’t start because it only works on.

We also noticed the problem only occured on ESXi hosts which are not HP DL380Gen8. So DL585 G5,G6,G7 gave these errors. Makes sense, the error also notices that it runs on Gen8 and older!

Solution

Luckily I found a VMWare article KB2085618 which described our problem:

Too bad the only solution is to remove the Agentless Management agent…by hand on the command line on 50+ ESXi hosts.

Damn! I was too lame to do this by hand so build a little powerCLI script. It’s not completed yet or error free. It was just a quick and dirty solution for fast results. So it’s not yet completed, but would like to share it already as it is faster then enabling SSH everywhere, connecting to ESXi hosts, insert commands, reboot etc.

Script

Pre-requirements

– Connect to your vCenter

– Put host in maintenance mode

– Load the module of function

– Plink installed and edited the script to use the right Plink directory

Download Plink here:

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

 


Function Enable-TSM {
Param (
[parameter(valuefrompipeline = $true, mandatory = $true,
HelpMessage = "Enter an ESX(i) entity")]
[PSObject]$VMHost,
[switch]$Local)

process {
switch ($VMHost.gettype().name) {
"String" {
if ($Local) {$VMHost = Get-VMHost -Name $VMHost | Enable-TSM -Local}
else {$VMHost = Get-VMHost -Name $VMHost | Enable-TSM}
}
"VMHostImpl" {
if ($Local) {
$VMHost | Get-VMHostService | Where {$_.Key -eq "TSM"} | %{
if ($_.running -eq $false) {
$_ | Start-VMHostService -Confirm:$false | Out-Null
Write-Host "$($_.Label) on $VMHost started"
}
else {Write-Warning "$($_.Label) on $VMHost already started"}
}
}
else {
$VMHost | Get-VMHostService | Where {$_.Key -eq "TSM-SSH"} | %{
if ($_.running -eq $false) {
$_ | Start-VMHostService -Confirm:$false | Out-Null
Write-Host "$($_.Label) on $VMHost started"
}
else {Write-Warning "$($_.Label) on $VMHost already started"}
}
}
}
default {throw "No valid type for parameter -VMHost specified"}
}
}
}


Function Disable-TSM {
Param (
[parameter(valuefrompipeline = $true, mandatory = $true,
HelpMessage = "Enter an ESX(i) entity")]
[PSObject]$VMHost,
[switch]$Local)

process {
switch ($VMHost.gettype().name) {
"String" {
if ($Local) {$VMHost = Get-VMHost -Name $VMHost | Disable-TSM -Local}
else {$VMHost = Get-VMHost -Name $VMHost | Disable-TSM}
}
"VMHostImpl" {
if ($Local) {
$VMHost | Get-VMHostService | Where {$_.Key -eq "TSM"} | %{
if ($_.running -eq $true) {
$_ | Stop-VMHostService -Confirm:$false | Out-Null
Write-Host "$($_.Label) on $VMHost stopped"
}
else {Write-Warning "$($_.Label) on $VMHost already stopped"}
}
}
else {
$VMHost | Get-VMHostService | Where {$_.Key -eq "TSM-SSH"} | %{
if ($_.running -eq $true) {
$_ | Stop-VMHostService -Confirm:$false | Out-Null
Write-Host "$($_.Label) on $VMHost stopped"
}
else {Write-Warning "$($_.Label) on $VMHost already stopped"}
}
}
}
default {throw "No valid type for parameter -VMHost specified"}
}
}
}

These functions were still in my profile so I put them on the site but were not created by me, these are only needed to enable/disable SSH.


function Get-HP{
<#
#Help file
#>
[CmdletBinding()]
param(
[Parameter(Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[System.String]
$VMhostName,

[switch]$Status,

[switch]$Remove
)
try {
$Hosts = Get-VMHost $VMhostName
if ($Status -eq $true){
#######Check for HP-AMS Provider Status #######
foreach ($VMHost in $Hosts){
$ESXCLI = Get-EsxCli -VMHost $VMHost
$HP = $ESXCLI.software.vib.list() | Where { $_.Name -like "hp-ams"} | Select @{N="VMHost";E={$ESXCLI.VMHost}}, Name, Version
if ($HP.name -eq "hp-ams"){
if($Hosts.Model -match "Gen8"){
Write-Host -fore Green "HP-AMS Provider found on" $HP.VMhost $hosts.model "Version:" $HP.version
}
else {
Write-Host -fore Red "Please remove HP-AMS Provider found on" $HP.VMhost $hosts.model "Version:" $HP.version
}
}
else{
Write-Host -ForegroundColor Red "No HP-AMS Provider found on $HP.VMhost $hosts.model"
}
}
}
elseif ($Remove -eq $true){
#######Remove option#######
# Maintenance mode check

Write-Host "Checking Maintenance mode"
if ((Get-VMHost $hosts | select ConnectionState).Connectionstate -ne "Maintenance")
{throw "Put host in maintenance mode please"}
else
{
Write-Host -ForegroundColor Green "Maintenance mode OK"
#2 Enable SSH
Enable-TSM $Hosts
if ((Get-VMHostService -VMHost $Hosts|?{$_.key -eq "TSM-SSH"}).running -eq "True")
{Write-Host -ForegroundColor Green "SSH running succesfull"}
else
{Write-Host -ForegroundColor Red "SSH failed starting"}

#3     HP Service stoppen middels Plink actie
# Creating alias for plink and test path
if (-not (test-path "D:\Putty\plink.EXE")) {throw "D:\Putty\plink.EXE needed"}
set-alias plink "D:\Putty\plink.EXE"
$Str1 = 'echo Y | plink -pw Password -l root '
$Stop = ' /etc/init.d/hp-ams.sh stop'
$Server = $hosts.name
$command= $str1+$server+$Stop
$output = Invoke-Expression -Command $command
$output

#4Verwijderen HP service
Write-Host "Starting removal"
$Str2 = 'plink -pw Password -l root '
$Remover = ' esxcli software vib remove -n hp-ams'
$command= $str2+$Server+$Remover
$output1 = Invoke-Expression -Command $command
$output1
if ($output1 -like "*successfully*"){
Write-Host -ForegroundColor green "Removal completed succesfully"
if ($output1 -like "*reboot*")
{
Write-Host -ForegroundColor Yellow "Reboot required and starting now"
Restart-VMhost -VMHost $Hosts -Confirm:$false|Out-Null
Write-Host -ForegroundColor Yellow "Restart started"
}
Else{
write-host "Possible dry-run?"
}
}
Else {
if ($output1 -like "*NoMatchError*"){
Write-Host "Nothing to do already removed probably restart required";Disable-TSM $Hosts
}
else{}
}

}
}
else {
Write-Host "No switch parameter found, use -remove or -status";Disable-TSM $Hosts
}
}
catch {throw}
}

Switches

-status : Checks the status of the host, is the agent installed and which model is the host.

-remove : Checks if host is in maintenance mode, stops the HP-AMS service, uninstalls the HP-AMS service and restarts the VMhost

Execution

Example for a DL585G5

get-hp -VMhostName esx1.net –Status
Please remove HP-AMS Provider found on esx1.net ProLiant DL585 G5 Version: 500.10.0.0-18.434156

Example for a DL380 Gen8

get-hp -VMhostName esx2.net -Status
HP-AMS Provider found on esx2.net ProLiant DL380p Gen8 Version: 500.10.0.0-18.434156

#Remove exection (need to paste)

ESXi CPU Status Demand/Usage/Ready?

ESXi CPU Status Demand/Usage/Ready?

Demand: Amount of CPU the virtual machine is demanding / trying to use
Usage: Amount of CPU the virtual machine is actually currently being allowed to use
Ready: Amount of time the virtual machine is ready to run but unable to because vSphere could not find physical resources to run the virtual machine on.

Virtual machines can be in any one of four high-level CPU States:
Wait: This can occur when the virtual machine’s guest OS is idle (Waiting for Work), or the virtual machine could be waiting on vSphere tasks. Some examples of vSphere tasks that a vCPU may be waiting on are either waiting for I/O to complete (Blocked) or waiting for ESX level swapping to complete (SWPWT). These non-idle vSphere system waits are called VMWAIT.
Ready (RDY): A vCPU is in the Ready state when the virtual machine is ready to run but unable to run because the vSphere scheduler is unable to find physical host CPU resources to run the virtual machine on. One potential reason for elevated Ready time is that the virtual machine is constrained by a user-set CPU limit or resource pool limit, reported as max limited (MLMTD).
CoStop(CSTP): Time the vCPUs of a multi-way virtual machine spent waiting to be co-started. This gives an indication of the co-scheduling overhead incurred by the virtual machine.
Run: Time the virtual machine was running on a physical processor

ESXi Setting syslog and firewall settings with PowerCLI

Syslog and firewall configuration with PowerCLI

Setting PowerCLI

Due the arrival of some SIEM solution I needed to reconfigure ESXi hosts to not only point to our Kiwi Syslog server, but also to the appliance. So a good job for some PowerCLI

I had some trouble using the set-VMHostSysLogServer as it didn’t seem to work as expected. It worked on 2 hosts which hadn’t any syslog configured, but somehow I couldn’t set all to $Null or to the new value, very strange. But I don’t give up and found the Set-VMHostAdvancedConfiguration cmdlet to set the syslog values on another way.

get-vmhost| Set-VMHostAdvancedConfiguration -NameValue @{'Syslog.global.logHost'='syslog'} -confirm:$false

While testing I noted the message:

This cmdlet is deprecated. Use New-AdvancedSetting, Set-AdvancedSetting, or Remove-AdvancedSetting instead.

Mmm let’s have a look here:

get-vmhost|select -first 1|get-advancedsetting -Name syslog* |select name,value|Ft -a

Name                                        Value
—-                                           —–
Syslog.Remote.Port                   514
Syslog.Remote.Hostname          syslog
Syslog.Local.DatastorePath        [] /vmfs/volumes/4dd2476c-etc.

Let’s try to set it

Get-AdvancedSetting -Entity (get-vmhost|select -first 1) -Name Syslog.Remote.Hostname|Set-AdvancedSetting -Value syslog -confirm:$false

You also can set multiple values like:

Get-AdvancedSetting -Entity (get-vmhost|select -first 1) -Name Syslog.Remote.Hostname|Set-AdvancedSetting -Value syslog1,syslog2 -confirm:$false

After setting the proper syslog setting it was necessary to open the syslog firewall ports on ESXi. To do this on all hosts, it can easily be done with the onelinerbelow using the Get-VMHostFirewallException cmdlet

Get-VMHostFirewallException -VMHost (get-vmhost) -Name syslog|Set-VMHostFirewallException -Enabled:$True -Confirm:$false

Locked files ESXi5.1

Locked files 101

Locked files ?

Since  a few months we see more and more “consolidation needed messages” in our vCenter environment, we’re still searching for the reason. Of course we used some VMware articles to get in a bit deeper. But for a successful we need to remove the file lock. I made a little procedure which worked best for us. The information also can be found in the VMware KB’s but sometimes it’s a bit confusing because you have different approaches in different versions.

Investigating virtual machine file locks on ESXi/ESX (10051)Unable to delete the virtual machine snapshot due to locked files (2017072)
Unable to perform operations on a virtual machine with a locked disk (1003397)
Investigating hosted virtual machine lock files (1003857)

Find locked files

It’s easiest to power off the VM. So you’re sure there shouldn’t be any lock anymore. If you’re not sure or have a lot of disks it can be a b*tch to check for locks using the vmkfstools -D. It’s doable, but I found a easier way using the touch command.

Start a SSH session to one of the ESXi hosts in the cluster where the VM resides you want to investigate.

Move to the location of the VM on the datastore: # cd /vmfs/volumes/<datastore>/<VMname>
Now use the touch * command to change the datestamp of all file in this directory. If you have  a lock you will get output like :

touch: VM1-flat.vmdk: Device or resource busy
touch: VM1_3-flat.vmdk: Device or resource busy

Who owns me?

We know now that the 2 files above are in some kind of lock and cannot be written to. Now we can move on using the vmkfstools command to see which ESXi host owns the file.

# vmkfstools -D VM1-flat.vmdk
Lock [type 10c00001 offset 13957120 v 448, hb offset 3510272
gen 41, mode 2, owner 00000000-00000000-0000-000000000000 mtime 845230
num 1 gblnum 0 gblgen 0 gblbrk 0]

We now see 2 things Mode 2 which indicates the kind of lock. See the list below with the kind of modes there can be:

  • mode 0 = no lock
  • mode 1 = is an exclusive lock (vmx file of a powered on VM, the currently used disk (flat or delta), *vswp, etc.)
  • mode 2 = is a read-only lock (e.g. on the ..-flat.vmdk of a running VM with snapshots)
  • mode 3 = is a multi-writer lock (e.g. used for MSCS clusters disks or FT VMs)

We see mode 2 which indicates a file lock. Normally there also is an owner, but this shows only zeroes so we can’t use this information.

In the few lines below we see this information about the Read Only Owner:


RO Owner[0] HB Offset 3510272 53baaace-0e95cf36-46a4-441ea15dea84

Addr <4, 5, 119>, gen 269, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 34359738368, nb 32768 tbz 7056, cow 0, newSinceEpoch 32768, zla 3, bs 1048576

The read only owner of this lock is MAC address 441ea15dea84 which converts to 44:1e:a1:5d:ea:84.

Find the MAC

A fast and easy way to find the corresponding MAC address is with powerCLI. Because we know the cluster where it happens I came up with the command line below:

get-cluster *001*|Get-VMHost|Get-VMHostNetworkAdapter|where {$_.Mac -eq "<strong>44:1e:a1:5d:ea:84</strong>"}|select vmhost,,mac

VMHost                                                                                  Mac
——                                                                                          —
ESX.host.net                              44:1e:a1:5d:ea:84
Now we put this host in maintanance and reboot. Normally the file lock should be gone now and we can consolidate the VM.

Warning: Failed to connect to the agentx master agent (hpilo:)

Warning: Failed to connect to the agentx master agent (hpilo:)

Noticing

While doing by adminstrative tasks, I was wandering through the syslogs and I noticed a system returning the message below every few minutes.

hpHelper[14674]: Warning: Failed to connect to the agentx master agent (hpilo:)

Mmmz let’s try to reboot the system, this was no solution. I checked the settings of the ILO board of this system with a similar system (both HP DL380 Gen8) and saw some differences. One system without the problems was set to “agentless management” the other had some SNMP settings. I removed the SNMP settings and put it to agentless like the host without problems.

After that I restarted all the services to make sure it would reset the connection. To bad it didn’t work either.

Solution

With a little search I found this site.
This triggered me..might not be the problem on ESX side, but maybe on ILO side. So I resetted the ILO board and so the connection reinitialized. The AgentX messages disappeared and the problem seems to be solved.

 

Could not connect using the requested protocol PowerCLI

Problem

Suddenly (during a failover test) I noticed a script didn’t work anymore. The script I used connects directly to an ESX host. And I got the error below, somehow I never ever had this before and we used this script for a long time. So something changed, but when and why did this happen was still on my research list. After a google search for “requested protocol” error I found the solution

Connect-VIServer : 22-3-2014 9:03:07 Connect-VIServer Could not connect using the requested protocol.
At C:\Users\ee34028.adm\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1:361 char:42
+ $currentConnection = Connect-VIServer <<<< $vcHost -Credential $cred
 + CategoryInfo : ObjectNotFound: (:) [Connect-VIServer], ViServerConnectionException
 + FullyQualifiedErrorId : Client20_ConnectivityServiceImpl_Reconnect_ProtocolError,VMware.VimAutomation.ViCore.Cmdlets.Commands.ConnectVIServer

Solution

To get rid of this problem there is a solution that sets the powerCLI configuration to “Use no proxy”

The solution is described in:
Connecting to the AutoDeploy server fails with the error: Could not connect using the requested protocol (2011395)

Open your powerCLI console as administrator (else you don’t have sufficient rights to edit this setting)
To show the current configuration use “Get-powerCLIconfiguration

C:\>Get-PowerCLIConfiguration

Proxy Policy Default Server
                               Mode
-------------------------       ---------------
UseSystemProxy      Single

As you can see it uses the system’s proxy. To change this use “Set-PowerCLIConfiguration -ProxyPolicy NoProxy -Confirm” to set the proxy settings to “NoProxy”
You can choose two proxy policies

C:\>Set-PowerCLIConfiguration -ProxyPolicy NoProxy -Confirm
Perform operation?
Performing operation 'Update vSphere PowerCLI configuration.'?
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is "Y"): y
Proxy Policy Default Server 
                     Mode 
------------         --------------- 
NoProxy              Single

VMware : Converting IDE disks to SCSI

After migrating the linux environment from KVM to ESX (see my previous post how to do it). We noticed that the disks
where connected as IDE disks, therefor it wasn’t possible to (dynamicly) resize them or add more disks then 4 IDE slots (including CD-ROM)

It pretty easy to convert these to a SCSI disk, but it will require downtime.
See also the VMware post about this:
Converting a virtual IDE disk to a virtual SCSI disk (1016192)

It’s recommendend for Windows machines to repair the MBR of the disk as adviced in the article above.
When encountering problems you could have a look at :
Repairing boot sector problems in Windows NT-based operating systems (1006556)

Luckily we tested it a few times in the Linux environment without encoutering problems (all VM’s are RedHat 6.4 or higher)

1) Turn off the VM
2) Locate the ESX host from the VM
3) Locate the datastores of the disks to edit
4) Turn on SSH on the ESX
5) Connect using SSH and go to the VM folder

# cd /vmfs/volumes/<datastore_name>/<vm_name>/

 

Now open the VMDK file using a VI editor like VI or nano for more information about VI/Nano
Editing files on an ESX host using vi or nano (1020302)
*Note: Nano is not available in ESXi. But can manually be installed

6) In this case we edit the TEST_PAT.vmdk file

# vi TEST_PAT.vmdk

 

When you look at the file you will see a ddb.adaptertype = “IDE” this is the value ESX uses to determine the adapter to use.In this case, when you add the VMDK using  “add new disk -> use existing disk” it will see IDE and add an IDE adapter.

So wee need to change this value

Specify one of these parameters: lsilogic or  buslogic .

This table shows the adapter type for the guest operating system:

Guest Operating System
Adapter Type
Windows 2003, 2008, Vista
lsilogic
Windows NT, 2000, XP
buslogic
Linux
lsilogic

In this case we chose the lsilogic

Change IDE to LSILOGIC and save the file.

Next go back to your virtual machine and remove the disks you edited (don’t remove it from your storage), so wisely chose “Remove from Virtual Machine”
It’s important not to remove the disk first before you start editing because the VMDK descriptor file doesn’t exist yet if the disk is not connected to a VM.

Apply the settings. Now go back to edit settings -> add -> Harddisk -> Use an existing virtual disk -> Browse to the location of the disk file and click next a few times.
As you notice it will display Disk adapter as SCSI now.

Now you added your SCSI disk.

Thats it!

Auto-update VMware tools installation fails?

Somehow we noticed a few machines where it was not possible to auto-update the VMware tools from vCenter. After a time-out of 30 minutes the task failed. Result, no VMware tools installed and a task that failed (according to vCenter) but a task keeps remaining on the ESXi host/VM.

We never encountered this issue, after looking through log files, eventvwr etc. I didn’t find a proper explanation. Somehow I suddenly got a clear moment and thought about looking in the advancedsettings from the VM because a while ago I changed the template settings according to our yearly hastle with the security baseline 🙂
Easy with powerCLI:

get-vm <vm>|Get-AdvancedSetting

Mmz I found a tools setting:

Name: isolation.tools.autoinstall.disable
Value:true

This just means that the auto-install is disabled. That explains why it won’t work automatically, and manually we didn’t encounter any issue.

Hah, this easily can be edited also with powerCLI.

get-vm <VM>|Get-AdvancedSetting -Name isolation.tools.autoinstall.disable|Set-AdvancedSetting -Value:$false -confirm:$false

DAMN!

I throw this script in our test cluster, but somehow not all machines can be edited. A little investigation leaded me to the fact that the machines which failed got a “Time-out VMware tools install task”

So what about killing the task, it was not possible through vCenter, I tried to restart the management agents on the host but that wasn’t a success also.

Then I came accross this VMware article :
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1038542

  1. Run this command to identify the virtual machine ID:
    # vim-cmd vmsvc/getallvms 

    You see an output similar to:

    2256   testvm          [datastore] testvm/testvm.vmx              winNetStandardGuest       vmx-07    
  2. Note the ID of the virtual machine with the ongoing VMware Tools installation. In this example, the virtual machine ID is 2256.
  3. Run this command to stop the VMware Tools installation:
    # vim-cmd vmsvc/tools.cancelinstall <vm_ID>

    Where <vm_ID> is the ID of the virtual machine noted in Step

Now the Tools upgrade task is killed, now it is possible again to vMotion, change advanced settings etc.

So now it’s possible to edit the settings. I started the script again and it went flawlessly. Next step is to try the VMware tools installation again.

I picked a machine, started install VMware tools through vCenter and waited….after 30 minutes again: TIME-OUT!
Mmmmm maybe we need to power-off/power-on the VM to pick the new setting. So again, cancelled task, checked setting. All OK.

Shutdown vmguest, power-on and reinstall VMware tools through vCenter this seems to be the magical combination.

To wrap it up:

– Cancel any running tasks on the ESX host/VM
– Change the isolation.tools.autoinstall.disable setting
– Power off/on the VM
– Reinstall VMware tools

Best is of course to change the setting while the machine is powered off.

Or just do a manual install 🙂

Last but not least I created a little script to retrieve the key, and if the key exists, and the value not equals  “true”, it sets the setting to


foreach ($VM in get-cluster *199* | Get-VM){
$Setting = $VM|Get-AdvancedSetting -Name isolation.tools.autoinstall.disable|select name, value
if ($Setting -ne $null){
if ($Setting.value -ne "false"){
$VM|Get-AdvancedSetting -Name isolation.tools.autoinstall.disable|Set-AdvancedSetting -Value:$false -Confirm:$false
}
}
}

Root disk full WTF?

Root disk full  WTF?

Suddenly somehow we got a virtual machine which couldn’t be powered on, vCenters events showed messages like:

Power On virtual machine
<Computer>
A general system error occurred: Unknown error

Mmm strange, but seen it before. I used the articles in each section to find the solution

Part 1

I started to look at the /var/spool/snmp directory, but this one was empty so it could not be the cause of filling the /root disk.

  1. Connect to the ESXi host using SSH. For more information, see Using ESXi Shell in ESXi 5.0 and 5.1 (2004746).
  2. Check if SNMP is creating too many .trp files in the /var/spool/snmp directory on the ESXi host by running the command:ls /var/spool/snmp | wc -lNote: If the output indicates that the value is 2000 or more, this may be causing the full inodes.
  3. Delete the .trp files in the /var/spool/snmp/ directory by running the commands:# cd /var/spool/snmp
    # for i in $(ls | grep trp); do rm -f $i;done
  4. Change directory to /etc/vmware/ and back up the snmp.xml file by running the commands:# cd /etc/vmware
    # mv snmp.xml snmp.xml.bkup
  5. Create a new file named snmp.xml and open it using a text editor. For more information, see Editing files on an ESX host using vi or nano (1020302).
  6. Copy and paste these contents to the file:<?xml version="1.0" encoding="ISO-8859-1"?>
    <config>
    <snmpSettings><enable>false</enable><port>161</port><syscontact></syscontact><syslocation></syslocation>
    <EnvEventSource>indications</EnvEventSource><communities></communities><loglevel>info</loglevel><authProtocol></authProtocol><privProtocol></privProtocol></snmpSettings>
    </config>
  7. Save and close the file.
  8. Reconfigure SNMP on the affected host by running the command:# esxcli system snmp set –-enable=true
  9. To confirm the SNMP services are running normally again, run the command:# esxcli system snmp getHere is an example of the output:
    /etc/vmware # esxcli system snmp get
    Authentication: Communities: Enable: true Engineid: 00000063000000a10a0121cf Hwsrc: indications Loglevel: info Notraps: Port: 161 Privacy: Remoteusers: Syscontact: Syslocation: Targets: Users: V3targets:

But to be sure I disabled SNMP like mentioned in the article.
ESXi 5.1 host becomes unresponsive when attempting a vMotion migration or a configuration change (2040707)

Part 2

In the first article there is also a hint that HP Gen8 hardware could have similar issues with another resolution. HAH! We use that hardware, so I went on to remove the hpHelper.log file. With the steps below

To remove the hpHelper.log file:

  1. Log in to the ESXi host as the root user.
  2. Stop the HP Helper management agent by running the command:/etc/init.d/hp-ams.sh stop
  3. To remove the hpHelper.log file, run the command:rm /var/log/hpHelper.log
  4. To restart the HP Helper management agent, run the command:/etc/init.d/hp-ams.sh start

ESXi ramdisk full due to /var/log/hpHelper.log file size (2055924)

After I started everything and mentioned the HPHelper.log file was super small I tried again. DAMN! Same error, guess there is something else which is filling the root disk, so my search continues.

Part 3

Somehow I got a clear moment and thought about the sfcdb-watchdog service. After a search with sfcbd and root disk full I found this article :

ESXi 5.x host is disconnected from vCenter Server due to sfcbd exhausting inodes (2037798)

With the help of the command vdf -h I noticed that there was only 8 MB Free on /root

Ramdisk                   Size      Used Available
root                           32M        24M         8M
etc                             28M     224K     27M
tmp                         192M       40K    191M
hostdstats            834M        3M    830M

Mmm that’s not much.

To free up inodes:
  1. Connect to the ESXi host using SSH.
  2. To stop the sfcbd service, run the command:/etc/init.d/sfcbd-watchdog stop
  3. Manually delete the files in the var/run/sfcb directory to free inodes.To remove the files, run the commands:cd /var/run/sfcb
    rm [0-2]*
    rm [3-6]*
    rm [7-9]*
    rm [a-c]*
    rm [d-f]*
  4. To restart the sfcbd service, run the command:/etc/init.d/sfcbd-watchdog startNote: You may need to restart management agents after step 4. For more information, see Restarting the Management agents on an ESXi or ESX host (1003490).

After doing this I had like 30MB free on /root.

This issue should be resolved in update below. Unfortunatly it was not possible for us to update these hosts before with the latest patches.

To manually resolve this issue, you can prevent the root file system from being impacted

  1. Connect to the ESXi host using SSH.
  2. Run this command to stop the sfcbd service:/etc/init.d/sfcbd-watchdog stop
  3. Enable this command to be run at boot time:esxcli system visorfs ramdisk add –name sfcbtickets –min-size 0 –max-size 1024 –permissions 0755 –target /var/run/sfcbFor more information, see Modifying the rc.local or local.sh file in ESX/ESXi to execute commands while booting (2043564).
  4. Run this command to start the sfcbd service:/etc/init.d/sfcbd-watchdog start
  5. Run this command and verify if the new ramdisk exists:esxcli system visorfs ramdisk list
Note: This issue may also occur if the SNMP agent is enabled. For more information, see vCenter Server and vmkernel logs report the message: The ramdisk ‘root’ is full (2033073).
After doing this I was able to succesfully start up the virtual machine.
Now it’s time to make sure we do the updates asap!
Another handy article to gather file system information
1 2 3 4