Locked files ESXi5.1

Locked files 101

Locked files ?

Since  a few months we see more and more “consolidation needed messages” in our vCenter environment, we’re still searching for the reason. Of course we used some VMware articles to get in a bit deeper. But for a successful we need to remove the file lock. I made a little procedure which worked best for us. The information also can be found in the VMware KB’s but sometimes it’s a bit confusing because you have different approaches in different versions.

Investigating virtual machine file locks on ESXi/ESX (10051)Unable to delete the virtual machine snapshot due to locked files (2017072)
Unable to perform operations on a virtual machine with a locked disk (1003397)
Investigating hosted virtual machine lock files (1003857)

Find locked files

It’s easiest to power off the VM. So you’re sure there shouldn’t be any lock anymore. If you’re not sure or have a lot of disks it can be a b*tch to check for locks using the vmkfstools -D. It’s doable, but I found a easier way using the touch command.

Start a SSH session to one of the ESXi hosts in the cluster where the VM resides you want to investigate.

Move to the location of the VM on the datastore: # cd /vmfs/volumes/<datastore>/<VMname>
Now use the touch * command to change the datestamp of all file in this directory. If you have  a lock you will get output like :

touch: VM1-flat.vmdk: Device or resource busy
touch: VM1_3-flat.vmdk: Device or resource busy

Who owns me?

We know now that the 2 files above are in some kind of lock and cannot be written to. Now we can move on using the vmkfstools command to see which ESXi host owns the file.

# vmkfstools -D VM1-flat.vmdk
Lock [type 10c00001 offset 13957120 v 448, hb offset 3510272
gen 41, mode 2, owner 00000000-00000000-0000-000000000000 mtime 845230
num 1 gblnum 0 gblgen 0 gblbrk 0]

We now see 2 things Mode 2 which indicates the kind of lock. See the list below with the kind of modes there can be:

  • mode 0 = no lock
  • mode 1 = is an exclusive lock (vmx file of a powered on VM, the currently used disk (flat or delta), *vswp, etc.)
  • mode 2 = is a read-only lock (e.g. on the ..-flat.vmdk of a running VM with snapshots)
  • mode 3 = is a multi-writer lock (e.g. used for MSCS clusters disks or FT VMs)

We see mode 2 which indicates a file lock. Normally there also is an owner, but this shows only zeroes so we can’t use this information.

In the few lines below we see this information about the Read Only Owner:


RO Owner[0] HB Offset 3510272 53baaace-0e95cf36-46a4-441ea15dea84

Addr <4, 5, 119>, gen 269, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 34359738368, nb 32768 tbz 7056, cow 0, newSinceEpoch 32768, zla 3, bs 1048576

The read only owner of this lock is MAC address 441ea15dea84 which converts to 44:1e:a1:5d:ea:84.

Find the MAC

A fast and easy way to find the corresponding MAC address is with powerCLI. Because we know the cluster where it happens I came up with the command line below:

get-cluster *001*|Get-VMHost|Get-VMHostNetworkAdapter|where {$_.Mac -eq "<strong>44:1e:a1:5d:ea:84</strong>"}|select vmhost,,mac

VMHost                                                                                  Mac
——                                                                                          —
ESX.host.net                              44:1e:a1:5d:ea:84
Now we put this host in maintanance and reboot. Normally the file lock should be gone now and we can consolidate the VM.

Warning: Failed to connect to the agentx master agent (hpilo:)

Warning: Failed to connect to the agentx master agent (hpilo:)

Noticing

While doing by adminstrative tasks, I was wandering through the syslogs and I noticed a system returning the message below every few minutes.

hpHelper[14674]: Warning: Failed to connect to the agentx master agent (hpilo:)

Mmmz let’s try to reboot the system, this was no solution. I checked the settings of the ILO board of this system with a similar system (both HP DL380 Gen8) and saw some differences. One system without the problems was set to “agentless management” the other had some SNMP settings. I removed the SNMP settings and put it to agentless like the host without problems.

After that I restarted all the services to make sure it would reset the connection. To bad it didn’t work either.

Solution

With a little search I found this site.
This triggered me..might not be the problem on ESX side, but maybe on ILO side. So I resetted the ILO board and so the connection reinitialized. The AgentX messages disappeared and the problem seems to be solved.

 

Could not connect using the requested protocol PowerCLI

Problem

Suddenly (during a failover test) I noticed a script didn’t work anymore. The script I used connects directly to an ESX host. And I got the error below, somehow I never ever had this before and we used this script for a long time. So something changed, but when and why did this happen was still on my research list. After a google search for “requested protocol” error I found the solution

Connect-VIServer : 22-3-2014 9:03:07 Connect-VIServer Could not connect using the requested protocol.
At C:\Users\ee34028.adm\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1:361 char:42
+ $currentConnection = Connect-VIServer <<<< $vcHost -Credential $cred
 + CategoryInfo : ObjectNotFound: (:) [Connect-VIServer], ViServerConnectionException
 + FullyQualifiedErrorId : Client20_ConnectivityServiceImpl_Reconnect_ProtocolError,VMware.VimAutomation.ViCore.Cmdlets.Commands.ConnectVIServer

Solution

To get rid of this problem there is a solution that sets the powerCLI configuration to “Use no proxy”

The solution is described in:
Connecting to the AutoDeploy server fails with the error: Could not connect using the requested protocol (2011395)

Open your powerCLI console as administrator (else you don’t have sufficient rights to edit this setting)
To show the current configuration use “Get-powerCLIconfiguration

C:\>Get-PowerCLIConfiguration

Proxy Policy Default Server
                               Mode
-------------------------       ---------------
UseSystemProxy      Single

As you can see it uses the system’s proxy. To change this use “Set-PowerCLIConfiguration -ProxyPolicy NoProxy -Confirm” to set the proxy settings to “NoProxy”
You can choose two proxy policies

C:\>Set-PowerCLIConfiguration -ProxyPolicy NoProxy -Confirm
Perform operation?
Performing operation 'Update vSphere PowerCLI configuration.'?
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is "Y"): y
Proxy Policy Default Server 
                     Mode 
------------         --------------- 
NoProxy              Single

VMware : Converting IDE disks to SCSI

After migrating the linux environment from KVM to ESX (see my previous post how to do it). We noticed that the disks
where connected as IDE disks, therefor it wasn’t possible to (dynamicly) resize them or add more disks then 4 IDE slots (including CD-ROM)

It pretty easy to convert these to a SCSI disk, but it will require downtime.
See also the VMware post about this:
Converting a virtual IDE disk to a virtual SCSI disk (1016192)

It’s recommendend for Windows machines to repair the MBR of the disk as adviced in the article above.
When encountering problems you could have a look at :
Repairing boot sector problems in Windows NT-based operating systems (1006556)

Luckily we tested it a few times in the Linux environment without encoutering problems (all VM’s are RedHat 6.4 or higher)

1) Turn off the VM
2) Locate the ESX host from the VM
3) Locate the datastores of the disks to edit
4) Turn on SSH on the ESX
5) Connect using SSH and go to the VM folder

# cd /vmfs/volumes/<datastore_name>/<vm_name>/

 

Now open the VMDK file using a VI editor like VI or nano for more information about VI/Nano
Editing files on an ESX host using vi or nano (1020302)
*Note: Nano is not available in ESXi. But can manually be installed

6) In this case we edit the TEST_PAT.vmdk file

# vi TEST_PAT.vmdk

 

When you look at the file you will see a ddb.adaptertype = “IDE” this is the value ESX uses to determine the adapter to use.In this case, when you add the VMDK using  “add new disk -> use existing disk” it will see IDE and add an IDE adapter.

So wee need to change this value

Specify one of these parameters: lsilogic or  buslogic .

This table shows the adapter type for the guest operating system:

Guest Operating System
Adapter Type
Windows 2003, 2008, Vista
lsilogic
Windows NT, 2000, XP
buslogic
Linux
lsilogic

In this case we chose the lsilogic

Change IDE to LSILOGIC and save the file.

Next go back to your virtual machine and remove the disks you edited (don’t remove it from your storage), so wisely chose “Remove from Virtual Machine”
It’s important not to remove the disk first before you start editing because the VMDK descriptor file doesn’t exist yet if the disk is not connected to a VM.

Apply the settings. Now go back to edit settings -> add -> Harddisk -> Use an existing virtual disk -> Browse to the location of the disk file and click next a few times.
As you notice it will display Disk adapter as SCSI now.

Now you added your SCSI disk.

Thats it!

Auto-update VMware tools installation fails?

Somehow we noticed a few machines where it was not possible to auto-update the VMware tools from vCenter. After a time-out of 30 minutes the task failed. Result, no VMware tools installed and a task that failed (according to vCenter) but a task keeps remaining on the ESXi host/VM.

We never encountered this issue, after looking through log files, eventvwr etc. I didn’t find a proper explanation. Somehow I suddenly got a clear moment and thought about looking in the advancedsettings from the VM because a while ago I changed the template settings according to our yearly hastle with the security baseline 🙂
Easy with powerCLI:

get-vm <vm>|Get-AdvancedSetting

Mmz I found a tools setting:

Name: isolation.tools.autoinstall.disable
Value:true

This just means that the auto-install is disabled. That explains why it won’t work automatically, and manually we didn’t encounter any issue.

Hah, this easily can be edited also with powerCLI.

get-vm <VM>|Get-AdvancedSetting -Name isolation.tools.autoinstall.disable|Set-AdvancedSetting -Value:$false -confirm:$false

DAMN!

I throw this script in our test cluster, but somehow not all machines can be edited. A little investigation leaded me to the fact that the machines which failed got a “Time-out VMware tools install task”

So what about killing the task, it was not possible through vCenter, I tried to restart the management agents on the host but that wasn’t a success also.

Then I came accross this VMware article :
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1038542

  1. Run this command to identify the virtual machine ID:
    # vim-cmd vmsvc/getallvms 

    You see an output similar to:

    2256   testvm          [datastore] testvm/testvm.vmx              winNetStandardGuest       vmx-07    
  2. Note the ID of the virtual machine with the ongoing VMware Tools installation. In this example, the virtual machine ID is 2256.
  3. Run this command to stop the VMware Tools installation:
    # vim-cmd vmsvc/tools.cancelinstall <vm_ID>

    Where <vm_ID> is the ID of the virtual machine noted in Step

Now the Tools upgrade task is killed, now it is possible again to vMotion, change advanced settings etc.

So now it’s possible to edit the settings. I started the script again and it went flawlessly. Next step is to try the VMware tools installation again.

I picked a machine, started install VMware tools through vCenter and waited….after 30 minutes again: TIME-OUT!
Mmmmm maybe we need to power-off/power-on the VM to pick the new setting. So again, cancelled task, checked setting. All OK.

Shutdown vmguest, power-on and reinstall VMware tools through vCenter this seems to be the magical combination.

To wrap it up:

– Cancel any running tasks on the ESX host/VM
– Change the isolation.tools.autoinstall.disable setting
– Power off/on the VM
– Reinstall VMware tools

Best is of course to change the setting while the machine is powered off.

Or just do a manual install 🙂

Last but not least I created a little script to retrieve the key, and if the key exists, and the value not equals  “true”, it sets the setting to


foreach ($VM in get-cluster *199* | Get-VM){
$Setting = $VM|Get-AdvancedSetting -Name isolation.tools.autoinstall.disable|select name, value
if ($Setting -ne $null){
if ($Setting.value -ne "false"){
$VM|Get-AdvancedSetting -Name isolation.tools.autoinstall.disable|Set-AdvancedSetting -Value:$false -Confirm:$false
}
}
}

Root disk full WTF?

Root disk full  WTF?

Suddenly somehow we got a virtual machine which couldn’t be powered on, vCenters events showed messages like:

Power On virtual machine
<Computer>
A general system error occurred: Unknown error

Mmm strange, but seen it before. I used the articles in each section to find the solution

Part 1

I started to look at the /var/spool/snmp directory, but this one was empty so it could not be the cause of filling the /root disk.

  1. Connect to the ESXi host using SSH. For more information, see Using ESXi Shell in ESXi 5.0 and 5.1 (2004746).
  2. Check if SNMP is creating too many .trp files in the /var/spool/snmp directory on the ESXi host by running the command:ls /var/spool/snmp | wc -lNote: If the output indicates that the value is 2000 or more, this may be causing the full inodes.
  3. Delete the .trp files in the /var/spool/snmp/ directory by running the commands:# cd /var/spool/snmp
    # for i in $(ls | grep trp); do rm -f $i;done
  4. Change directory to /etc/vmware/ and back up the snmp.xml file by running the commands:# cd /etc/vmware
    # mv snmp.xml snmp.xml.bkup
  5. Create a new file named snmp.xml and open it using a text editor. For more information, see Editing files on an ESX host using vi or nano (1020302).
  6. Copy and paste these contents to the file:<?xml version="1.0" encoding="ISO-8859-1"?>
    <config>
    <snmpSettings><enable>false</enable><port>161</port><syscontact></syscontact><syslocation></syslocation>
    <EnvEventSource>indications</EnvEventSource><communities></communities><loglevel>info</loglevel><authProtocol></authProtocol><privProtocol></privProtocol></snmpSettings>
    </config>
  7. Save and close the file.
  8. Reconfigure SNMP on the affected host by running the command:# esxcli system snmp set –-enable=true
  9. To confirm the SNMP services are running normally again, run the command:# esxcli system snmp getHere is an example of the output:
    /etc/vmware # esxcli system snmp get
    Authentication: Communities: Enable: true Engineid: 00000063000000a10a0121cf Hwsrc: indications Loglevel: info Notraps: Port: 161 Privacy: Remoteusers: Syscontact: Syslocation: Targets: Users: V3targets:

But to be sure I disabled SNMP like mentioned in the article.
ESXi 5.1 host becomes unresponsive when attempting a vMotion migration or a configuration change (2040707)

Part 2

In the first article there is also a hint that HP Gen8 hardware could have similar issues with another resolution. HAH! We use that hardware, so I went on to remove the hpHelper.log file. With the steps below

To remove the hpHelper.log file:

  1. Log in to the ESXi host as the root user.
  2. Stop the HP Helper management agent by running the command:/etc/init.d/hp-ams.sh stop
  3. To remove the hpHelper.log file, run the command:rm /var/log/hpHelper.log
  4. To restart the HP Helper management agent, run the command:/etc/init.d/hp-ams.sh start

ESXi ramdisk full due to /var/log/hpHelper.log file size (2055924)

After I started everything and mentioned the HPHelper.log file was super small I tried again. DAMN! Same error, guess there is something else which is filling the root disk, so my search continues.

Part 3

Somehow I got a clear moment and thought about the sfcdb-watchdog service. After a search with sfcbd and root disk full I found this article :

ESXi 5.x host is disconnected from vCenter Server due to sfcbd exhausting inodes (2037798)

With the help of the command vdf -h I noticed that there was only 8 MB Free on /root

Ramdisk                   Size      Used Available
root                           32M        24M         8M
etc                             28M     224K     27M
tmp                         192M       40K    191M
hostdstats            834M        3M    830M

Mmm that’s not much.

To free up inodes:
  1. Connect to the ESXi host using SSH.
  2. To stop the sfcbd service, run the command:/etc/init.d/sfcbd-watchdog stop
  3. Manually delete the files in the var/run/sfcb directory to free inodes.To remove the files, run the commands:cd /var/run/sfcb
    rm [0-2]*
    rm [3-6]*
    rm [7-9]*
    rm [a-c]*
    rm [d-f]*
  4. To restart the sfcbd service, run the command:/etc/init.d/sfcbd-watchdog startNote: You may need to restart management agents after step 4. For more information, see Restarting the Management agents on an ESXi or ESX host (1003490).

After doing this I had like 30MB free on /root.

This issue should be resolved in update below. Unfortunatly it was not possible for us to update these hosts before with the latest patches.

To manually resolve this issue, you can prevent the root file system from being impacted

  1. Connect to the ESXi host using SSH.
  2. Run this command to stop the sfcbd service:/etc/init.d/sfcbd-watchdog stop
  3. Enable this command to be run at boot time:esxcli system visorfs ramdisk add –name sfcbtickets –min-size 0 –max-size 1024 –permissions 0755 –target /var/run/sfcbFor more information, see Modifying the rc.local or local.sh file in ESX/ESXi to execute commands while booting (2043564).
  4. Run this command to start the sfcbd service:/etc/init.d/sfcbd-watchdog start
  5. Run this command and verify if the new ramdisk exists:esxcli system visorfs ramdisk list
Note: This issue may also occur if the SNMP agent is enabled. For more information, see vCenter Server and vmkernel logs report the message: The ramdisk ‘root’ is full (2033073).
After doing this I was able to succesfully start up the virtual machine.
Now it’s time to make sure we do the updates asap!
Another handy article to gather file system information

vCenter 5.1 SSO upgrade to update 1

While upgrading from vCenter 5.1 to vCenter 5.1 Update 1 everything went fine (at least the installer). But when trying to logon in vCenter after half an hour I noticed it was only possible to login with a local account and not with a “domain” account.

While searching trough the SSO logs I saw some strange things like:

java.net.ConnectException: Connection timed out: connect

Troubleshooting VMware Single Sign-On configuration and installation issues in a Windows server (2033880)

When logging in to the Webclient and watching the SSO settings, all the domains where tested successfully. So there is a connection, but I guess something during the upgrade or a change in the domains caused it to fail.

I used the command below to do a rediscover of the domains, 2 new domain resources where added. Unfortunately it still not worked properly. But this didn’t change the default domains.
Now I removed all the domains listed in the SSO and did another rediscover.

C:\Program Files\VMware\Infrastructure\SSOServer\utils>ssocli.cmd configure-riat -a discover-is -u admin -p masterPassword

Now I noticed that the log files changed and a lot of other information came trough the logs.
I normally use Baretail to follow tails in Windows log files.

When I saw a lot of “Success” logins in the logfiles I had a good feeling it was working again.

Testing…

Login works fine now !

After 24 hours, it seemed to be failing again, now I removed everything again, waited a few minutes to be sure the DB has time to cleanup. Then I re-added the domains, according to the log files everything should be working again. But when I try to login I now received an error message that I don’t have any authorization. I noticed when I logged in locally the permissions are missing. So I needed to re-add them to the folders etc.

So be warned that permissions can be removed when waiting to long !

Resources:
Logging in to vSphere Client 5.1 fails with the error: The server took too long to respond (2038918)
Updating the vCenter Single Sign On server database configuration (2045528)

Snapshots and a twoGbMaxExtentSparse VMDK

While converting some machines from KVM to ESX, we suddenly noticed that a scheduled VEAAM backup job, killed the VM and prevented it starting. We got messages like this. Somehow we have a lot of other Linux machines which work flawlessly including backups.

Power On virtual machine
LINUX
File [Datastore]
LINUX/LINUX_rootvg-000001.vmdk
was not found
View details...

Consolidate virtual machine disk files
LINUX
File [] was not found

The strange thing I noticed when I logged in with SSH I saw  a lot of -S001,vmdk disks. Mmm…does me reming to another post I made earlier:  See last part of this post, I didn’t write about the details but I noticed that the imported machines where twoGbMaxExtentSparse format.

With the storage vMotion or vmkfstools inflate it is converted to a normal single VMDK.

Mmm….should this cause the snapshot to fail and let the VM think it’s disk is lost ?

What I did is remove the disks from the VM, re-added them to the VM and directly after that svMotioned them to another datastore so the disk will be inflated. After that the machine starts flawlessly. Now I will need to do a little research to see if the snapshot operation caused the VM to crash.

It’s also possible to use the vmkfstools -i to inflate the disk of course.

Extra resources:

Recreating a missing virtual machine disk (VMDK) descriptor file (1002511)
Recreating a missing virtual disk (VMDK) descriptor file for disks split into 2GB files (1026266)
Cannot power on a virtual machine because the virtual disk cannot be opened (1004232)

 

 

 

PowerCLI function to grab WWPN’s

PowerCLI function to grab WWPN’s

Why?

As speaking with our storage administrators they sometimes ask for the exact WWPN so they are sure to remove the right LUNs etc. Because it’s a few mouseclicks to find the WWPN and it’s hard to copy, I managed to get it done by a little PowerCLI function below.

Function


function Get-WWN{
<#
.SYNOPSIS
Get WWN name from ESX hosts
.DESCRIPTION
This scripts gathers the WWN portname from ESX hosts
.NOTES
Authors: Patrick Heijmann
.PARAMETER VMhosts
Specify the VMhosts To gather the ports from
.EXAMPLE
PS> get-wwn -VMhosts ESXhost001
.EXAMPLE
PS> Get-Cluster -Name *|get-vmhost|Get-WWN
.EXAMPLE
PS> Get-vmhost *|Get-WWN
#>
Param (
[Parameter(
Valuefrompipeline = $true,
ParameterSetName = "VMhosts",
Mandatory = $true,
HelpMessage = "Enter Host name")]
[String[]]$VMhosts)

process
{foreach ($vmhost in $vmhosts){Write-Host -foregroundcolor green "Server: " $vmhost
$hbas = Get-VMHostHba -vmhost $vmhost -Type FibreChannel
foreach ($hba in $hbas){$wwpn = "{0:x}" -f $hba.PortWorldWideName
Write-Host -foregroundcolor green `t "World Wide Port Name:" $wwpn}
}
}
}
1 2 3 4 5 6