vSphere 6.5 Update 1 is out, here’s why you want to upgrade
Hi VMware have just released the first major update to vSphere 6.5, normally, I don’t blog on these but this update is so big and it fixes some really annoying […]
Dell Storage, PowerStore, PowerFlex PowerMax & PowerScale, Virtualization & Containers Technologies
Hi VMware have just released the first major update to vSphere 6.5, normally, I don’t blog on these but this update is so big and it fixes some really annoying […]
Hi
VMware have just released the first major update to vSphere 6.5, normally, I don’t blog on these but this update is so big and it fixes some really annoying bugs I saw using the GA version of vSphere 6.5..thankfully, we worked hard with their support to overcome some of the issues I highlighted in yellow, this was of course done for the greater good.
The release notes for ESXI 6.5 U1 can be seen here https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html and it can be downloaded from here https://my.vmware.com/web/vmware/details?downloadGroup=ESXI65U1&productId=614&rPId=17343
The release notes for vCenter 6.5 U1 can be seen here https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-vcenter-server-651-release-notes.html and it can be downloaded from here https://my.vmware.com/web/vmware/details?downloadGroup=VC65U1&productId=614&rPId=17343
Below you can see that partial list of things that were close to my heart.
Storage Issues
To define the storage I/O scheduling policy for a virtual machine, you can configure the I/O throughput for each virtual machine disk by modifying the IOPS limit. When you edit the IOPS limit and CBT is enabled for the virtual machine, the operation fails with an error The scheduling parameter change failed. Due to this problem, the scheduling policies of the virtual machine cannot be altered. The error message appears in the vSphere Recent Tasks pane.
You can see the following errors in the /var/log/vmkernel.log file:
2016-11-30T21:01:56.788Z cpu0:136101)VSCSI: 273: handle 8194(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000
2016-11-30T21:01:56.788Z cpu0:136101)ScsiSched: 2760: Invalid Bandwidth Cap Configuration
2016-11-30T21:01:56.788Z cpu0:136101)WARNING: VSCSI: 337: handle 8194(vscsi0:0):Failed to invert policy
This issue is resolved in this release.
When you hot-add two or more hard disks to a VMware PVSCSI controller in a single operation, the guest OS can see only one of them.
This issue is resolved in this release.
An ESXi host might fail with a purple screen because of a race condition when multiple multipathing plugins (MPPs) try to claim paths.
This issue is resolved in this release.
If a VVol VASA Provider returns an error during a storage profile change operation, vSphere tries to undo the operation, but the profile ID gets corrupted in the process.
This issue is resolved in this release.
Per host Read or Write latency displayed for VVol datastores in the vSphere Web Client is incorrect.
This issue is resolved in this release.
The NFS v3 client does not properly handle a case where NFS server returns an invalid filetype as part of File attributes, which causes the ESXi host to fail with a purple screen.
This issue is resolved in this release.
The lsi_mr3 driver allocates memory from address space below 4GB. The vSAN disk serviceability plugin lsu-lsi-lsi-mr3-plugin and the lsi_mr3 driver communicate with each other. The driver might stop responding during the memory allocation when handling the IOCTL event from storelib. As a result, lsu-lsi-lsi-mr3-plugin might stop responding and the hostd process might also fail even after restart of hostd.
This issue is resolved in this release with a code change in the lsu-lsi-lsi-mr3-plugin plugin of lsi_mr3 driver, setting a timeout value to 3 seconds to get the device information to avoid plugin and hostd failures.
When you hot-add an existing or new virtual disk to a CBT enabled VM residing on VVOL datastore, the guest operation system might stop responding until the hot-add process completes. The VM unresponsiveness depends on the size of the virtual disk being added. The VM automatically recovers once hot-add completes.
This issue is resolved in this release.
When you use vSphere Storage vMotion on vSphere Virtual Volumes storage, the UUID of a virtual disk might change. The UUID identifies the virtual disk and a changed UUID makes the virtual disk appear as a new and different disk. The UUID is also visible to the guest OS and might cause drives to be misidentified.
This issue is resolved in this release.
An ESXi host might stop responding if a LUN unmapping is made on the storage array side to those LUNs while connected to an ESXi host through Broadcom/Emulex fiber channel adapter (the driver is lpfc) and has I/O running.
This issue is resolved in this release.
When opening a VMFS-6 volume, it allocates a journal block. Upon successful allocation, a background thread is started. If there is no space on the volume for the journal, it is opened in read-only mode and no background thread is initiated. Any intent to close the volume, results in attempts to wake up a nonexistent thread. This results in the ESXi host failure.
This issue is resolved in this release.
When the virtual machines use the SCP4 feature with Get LBA Status command to query thin-provisioned features of large vRDMs attached, the processing of this command might run for a long time in the ESXi kernel without relinquishing the CPU. The high CPU usage can cause the CPU heartbeat watchdog process to deem a hung process and the ESXi host might stop responding.
This issue is resolved in this release.
A VMDK file might reside on a VMFS6 datastore which is mounted on multiple ESXi hosts (for example 2 hosts, ESXi host1 and ESXi host2). When the VMFS6 datastore capacity is increased from ESXi host1, while having it mounted on ESXi host2, and the disk.vmdk has file blocks allocated from an increased portion of the VMFS6 datastore from ESXi host1. Now, if the disk.vmdk file is accessed from ESXi host2, and if the file blocks are allocated to it from ESXi host2, the ESXi host2 might fail with a purple screen.
This issue is resolved in this release.
If the paths to a LUN have different LUN IDs in case of multipathing, the LUN will not be registered by PSA and end users will not see them.
This issue is resolved in this release.
The recompose operation in Horizon View might fail for desktop virtual machines residing on NFS datastores with stale NFS file handle errors, because of the way virtual disk descriptors are written to NFS datastores.
This issue is resolved in this release.
An ESXi host might fail with a purple screen because of a CPU heartbeat failure only if the SEsparse is used for creating snapshots and clones of virtual machines. The use of SEsparse might lead to CPU lockups with the warning message in the VMkernel logs, followed by a purple screen:
PCPU <cpu-num> didn’t have a heartbeat for <seconds> seconds; *may* be locked up.
This issue is resolved in this release.
The frequent lookup to a vSAN metadata directory (.upit) on virtual volume datastores can impact its performance. The .upit directory is not applicable to virtual volume datastores. The change disables the lookup to the .upit directory.
This issue is resolved in this release.
Performance issues might occur when the not aligned unmap requests are received from the Guest OS under certain conditions. Depending on the size and number of the not aligned unmaps, this might occur when a large number of small files (less than 1 MB in size) are deleted from the Guest OS.
This issue is resolved in this release.
ESXi 5.5 and 6.x hosts stop responding after running for 85 days. In the /var/log/vmkernel log file you see entries similar to:
YYYY-MM-DDTHH:MM:SS.833Z cpu58:34255)qlnativefc: vmhba2(5:0.0): Recieved a PUREX IOCB woh oo
YYYY-MM-DDTHH:MM:SS.833Z cpu58:34255)qlnativefc: vmhba2(5:0.0): Recieved the PUREX IOCB.
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): sizeof(struct rdp_rsp_payload) = 0x88
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674qlnativefc: vmhba2(5:0.0): transceiver_codes[0] = 0x3
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): transceiver_codes[0,1] = 0x3, 0x40
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): Stats Mailbox successful.
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): Sending the Response to the RDP packet
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)————————————————————–
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 53 01 00 00 00 00 00 00 00 00 04 00 01 00 00 10
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) c0 1d 13 00 00 00 18 00 01 fc ff 00 00 00 00 20
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 00 00 00 88 00 00 00 b0 d6 97 3c 01 00 00 00
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)————————————————————–
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 02 00 00 00 00 00 00 80 00 00 00 01 00 00 00 04
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 18 00 00 00 00 01 00 00 00 00 00 0c 1e 94 86 08
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 0e 81 13 ec 0e 81 00 51 00 01 00 01 00 00 00 04
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 2c 00 04 00 00 01 00 02 00 00 00 1c 00 00 00 01
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 00 00 00 40 00 00 00 00 01 00 03 00 00 00 10
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)50 01 43 80 23 18 a8 89 50 01 43 80 23 18 a8 88
YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 01 00 03 00 00 00 10 10 00 50 eb 1a da a1 8f
This is a firmware problem and it is caused when Read Diagnostic Parameters (RDP) between the Fibre Channel (FC) Switch and the Hot Bus Adapter (HDA) fails 2048 times. The HBA adapter stops responding and because of this the virtual machine and/or the ESXi host might fail. By default, the RDP routine is initiated by the FC Switch and occurs once every hour, resulting in a reaching the 2048 limit in approximately 85 days.
This issue is resolved in this release.
Some Intel devices, for example P3700, P3600, and so on, have a vendor specific limitation on their firmware or hardware. Due to this limitation, all IOs across the stripe size (or boundary), delivered to the NVMe device can be affected from significant performance drop. This problem is resolved from the driver by checking all IOs and splitting command in case it crosses the stripe on the device.
This issue is resolved in this release.
The driver might reset the controller twice (disable, enable, disable and then finally enable it) when the controller starts. This is a workaround for the QEMU emulator for an early version, but it might delay the display of some controllers. According to the NVMe specifications, only one reset is needed, that is, disable and enable the controller. This upgrade removes the redundant controller reset when starting the controller.
This issue is resolved in this release.
An ESXi host might stop responding and fail with purple screen with entries similar to the following as a result of a CPU lockup.
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]@BlueScreen: PCPU x: no heartbeat (x/x IPIs received)
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Code start: 0xxxxx VMK uptime: x:xx:xx:xx.xxx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Saved backtrace from: pcpu x Heartbeat NMI
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]MCSLockWithFlagsWork@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PB3_Read@esx#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PB3_AccessPBVMFS5@esx#nover+00xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3FileOffsetToBlockAddrCommonVMFS5@esx#nover+0xx stack:0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_ResolveFileOffsetAndGetBlockTypeVMFS5@esx#nover+0xx stack:0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_GetExtentDescriptorVMFS5@esx#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_ScanExtentsBounded@esx#nover+0xx stack:0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3GetFileMappingAndLabelInt@esx#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_FileIoctl@esx#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]FSSVec_Ioctl@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]FSS_IoctlByFH@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSIFsEmulateCommand@vmkernel#nover+0xx stack: 0x0
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSI_FSCommand@vmkernel#nover+0xx stack: 0x1
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSI_IssueCommandBE@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSIExecuteCommandInt@vmkernel#nover+0xx stack: 0xb298e000
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PVSCSIVmkProcessCmd@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PVSCSIVmkProcessRequestRing@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PVSCSI_ProcessRing@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VMMVMKCall_Call@vmkernel#nover+0xx stack: 0xx
0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VMKVMM_ArchEnterVMKernel@vmkernel#nover+0xe stack: 0x0
This occurs if your virtual machine’s hardware version is 13 and uses SPC-4 feature for the large virtual disk.
This issue is resolved in this release.
According to the kernel log, the ATAPI device is exposed on one of the AHCI ports of the Marvell 9230 controller. This Marvel Console device is an interface to configure RAID of the Marvell 9230 AHCI controller, which is used from some Marvell CLI tools.
As a result of the esxcfg-scsidevs -l command, the host equipped with the Marvell 9230 controller cannot detect the SCSI device with the Local Marvell Processor display name.
The information in the kernel log is:
WARNING: vmw_ahci[XXXXXXXX]: scsiDiscover:the ATAPI device is not CD/DVD device
This issue is resolved in this release.
Depending on the workload and the number of virtual machines, diskgroups on the host might go into permanent device loss (PDL) state. This causes the diskgroups to not admit further IOs, rendering them unusable until manual intervention is performed.
This issue is resolved in this release.
The ESXi functionality that allows unaligned unmap requests did not account for the fact that the unmap request may occur in a non-blocking context. If the unmap request is unaligned, and the requesting context is non-blocking, it could result in a purple screen. Common unaligned unmap requests in non-blocking context typically occur in HBR environments.
This issue is resolved in this release.
Due to a memory leak in the LVM module, you might see the LVM driver running out of memory on certain conditions, causing the ESXi host to lose access to the VMFS datastore.
This issue is resolved in this release.
1 Comment »