Categories
Windows Server 2003

Windows 2003 : DNS Scavenging

Configuring DNS Scavenging

In trawling through one of our reverse DNS zones I noticed several duplicate RR entries for DHCP IP addresses; most of which had a time stamp that was several weeks old. In our environment we use DHCP DNS Dynamic Updates for client registration.

At the same time we noticed that McAfee EPO was reporting strange client names and UNIX systems that perform reverse DNS when using SSH would report the incorrect FQDN for remote connections.

 
Scavenging Options

To resolve all of the above symptoms we neded to implement DNS Scavenging. The internal DNS infrastructure runs from AD-Integrated zones on Windows 2003 R2 x64 Domain Controllers.

DNS Scavenging Terminology

    * No-Refresh Interval; prohibits updates for a specific period.
    * Refresh Interval; allows updates for a specific period after which a record can be deleted.

 The *total sum* of this period should equate the DHCP scope lease expiration time as illustrated below:

 

For example, on an environment with a DHCP lease time of 3 days:
    * No-Refresh Interval: 1 Day
    * Refresh Interval: 2 Days

For a default DHCP environment with a lease time of 8 days:
    * No-Refresh Interval: 3 Day
    * Refresh Interval: 5 Days

Implementation of DNS Scavenging

Scavenging must be enabled at the server level and zone level.

Scavenging should only be enabled on a single DNS server within your environment, this makes troubleshooting much simpler in the event of scavenging failing – it also makes configuration far simpler.

On the server I have configured the following settings:

On the zonethe following setting are required – zone level settings override server level settings:

 

 

Categories
Cisco Articles

Cisco : CCNA Wireless Cram Sheet

CCNA Wireless Cram Sheet

 

Types of WLAN technology

 

Narrowband (unlicensed bands)

·         900 MHz – used by old cordless phones

·         2.4 GHz – used by cordless phones, WLAN, Bluetooth and microwaves

·         5G GHz – used by WLAN, new cordless phones

·         Uses spread spectrum – signalling over multiple frequencies.

·         Limited range

 

Broadband

·         Lower bandwidth than narrow band

·         Wider coverage.

·         Personal Communication Services (PCS) – Sprint PCS is an example supplier of this technology.

 

Circuit and Packet Data

·         Lower data rate than both of the above.

·         Wider coverage (national).

·         High fee per megabit – although flat-rate contracts are common nowadays

·         3G is an example of this technology.

 

Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA)

WLAN devices cannot send and receive at the same time. Devices use RTS (ready-to-send) and CTS (clear-to-send) signals.

 

Wireless AP’s are similar in function to Ethernet Hubs; each AP has a finite bandwidth therefore the more devices attached to the AP, the less bandwidth each device has available to it.

 

Signal Strength Issues

·         Absorption – walls, ceilings and floors absorb signals.

·         Scattering – rough walls and carpets scatter signals.

·         Reflection – metal and glass reflect signals.

·         (Interference – Microwaves, rouge AP’s and cordless phones can interfere)

 

Standards Bodies

·         FCC – Federal Communications Commission

·         ETSI – European Telecommunications Standards Institute

·         ITU-R – International Telecommunications Union-Radio Communications Sector

·         IEEE – Institute of Electrical and Electronic Engineers – defines mechanical process of how WLAN is implemented in 802.11

·         WiFi Alliance – Cisco is a founding member of this organisation. Ensures interoperability between manufacturers.

 

Wireless Standards

 

802/11a

5GHz

54Mbs

OFDM

802.11b

2.4GHz

11Mbs

DSSS

802.11g

2.4GHz

54Mbs

DSSS/OFDM

802.11n

2.4/5 GHz

248Mbs

MIMO

 

OFDM – Octagonal Frequency Division Multiplexing; uses spread spectrum.

DSSS – Direct-Sequence Spread Spectrum; One channel to send data across all frequencies in the channel

MIMO – Multiple Input Multiple Output; uses DSSS and OFD across 14 overlapping channels at 5MHz intervals.

 

 

Compatibility

·         802.11b and 802.11g can interpolate, 802.11g is backwards compatible.

·         802.11a is not compatible with 802.11b or 802.11g.

·         802.11n is compatible with 802.11a, 802.11b and 802.11g however is will be slower in interoperability mode. Also 802.11n has not been ratified, so there may be interoperability issues between vendor hardware. 802.11n requires multiple antennae for MIMO.

 

Security

 

Potential Threats

·         War Driving – a potential hacker uses a laptop to find a wireless network and tries to break in.

 

Connection Process

·         Service Set Identifier (SSID) used to identify network to clients, this is broadcast.

·         Client Send AP MAC Address and required security information

 

802.11 Defined Security

The 802.11 standard defines two security methods, both of which are weak by today’s standards:

·         Open Authentication (no security!)

·         Shared Key Authentication – static encryption using WEP

 

A well-secured WLAN has the following security configurations:

·         Encryption

·         Authentication

·         IPS

 

SSID Cloaking and MAC Address Filtering

 

SSID Cloaking –Administrator would disable SSID broadcast. However, client can send AP a null string SSID value. Therefore MAC Address filtering was often enabled. Unfortunately it is also possible to spoof a MAC address.

 

Wireless Encryption Protocol (WEP)

 

Uses RC4 encryption and a static 64-bit key can easily be broken as only 40-bits are encrypted and 24 bits are clear-text IV(Initialization Vector). It was later upgraded to 128-bit, but the IV was still clear text meaning it took slightly longer (minutes) to break-in.

 

TKIP (Temporal Key Integrity Protocol)

 

Initially Cisco hardware specific, later became and open standard – beware no interop between Cisco original and now open TKIP. Per-packet keying and hashing using CMIC (Cisco Message Integrity Check) – each packet is digitally signed.

 

802.1 EAP

 

Extensible Authentication Protocol is a 2-layer process with 2 varieties:

·         EAP (WLAN)

·         EAPoLAN

 

EAP defines a standard way of encapsulating authentication information such as certificates/usernames/passwords that an AP can use for authentication.

 

EAP is an extension of PPP and has several extensions:

·         EAP-MD5 – CHAP authentication with static password

·         EAP-TCS – X.509v3 certificates

·         LEAP – Lightweight EAP, password and per-session WEP keys

·         PEAP – One Time Password OTP SSL secures communications and MS-CHAP used to encrypt username and password. Digital certificate required on server.

·         EAP-FAST – Shared secret key used to encrypt authentication information.

·         EAP-GTC – authentication by Generic Card Token.

 

802.1x and RADIUS defines how to packetize the EAP information and move it across the network. In the RADIUS model:

·         Client is the Supplicant

·         AP is the Authenticator

·         RADIUS Server is the Authentication Server

 

WiFi Protected Access (WPA)

Designed as an interim solution, until 802.11i (WPA2) was ratified, for wireless security by the WiFi Alliance.

Authentication handled by 802.1x and TKIP used with WEP. The TKIP flavour used by WPA is non-proprietary and is NOT compatible with the Cisco TKIP implementation.

 

Personal Mode – Pre-shared Key (PSK) used to authenticate, key stored on client and server -designed for SOHO use.

 

Enterprise Mode – allows for large organisations to have a centralised credential server. Uses 802.1x for authentication.

 

WPA2 (802.11i)

 

Doesn’t use WEP, using AES (Advanced Encryption Standard) alongside CBC-MAC protocol (CCMP)

 

AES-CCMP incorporates AES 128-bit encryption with 2 cryptographic technologies:

·         Counter mode makes eavesdropping more difficult by stopping patterns in WLAN traffic

·         CBC-MAC ensures frames have not been tampered

 

WLAN Access Modes

 

Ad-Hoc (IBSS – Independent Basic Service Set) – peer-to-peer – presents security and scalability issues

Infrastructure (BSSBasic Service Set or ESSExtended Service Set) – via an AP

 

Infrastructure modes:

·         BSS – Basic Service Set – provides per-device BSSID. Used for non-roaming devices.

·         ESS – Extended Service Set – provides a single SSID for all devices. Only each AP has its own BSSID.

 

Coverage:

·         BSA – Basic Service Area – single AP (cell)

·         ESA – Extended Service Area – multi-APs (cells) on different channels, but the same frequency (i.e 2.4GHz/5GHz) on non-VOIP networks overlap should be 10-15%, on VOIP it should be 15-50%.

 

An AP is a layer 3 device and in larger organisations ‘IP helper’/DHCP forwarding may be required on the AP.

 

Configuring APs/Troubleshooting WLAN

 

Cisco recommends using the SDM (Security Device Manager) to configure APs.

 

Common troubleshooting tasks:

·         Check signal strength, check device placement and either adjust aerial or replace it with a more powerful one

·         Check encryption settings, do the device and AP support the same encryption standards

·         WLAN NIC firmware update may resolve connectivity issues.

Categories
Presentation Server

Citrix : The RPC server cannot be contacted on server .

Citrix : The RPC server cannot be contacted on server .

This issue has plague me infrequently over the last 3 – 6 months – a Citrix server in a PS4.5 farm would suddenly be unable to use the Citrix Access Management Console as when the discovery process was running it would report:

 The RPC server cannot be contacted on server .

The solution for this is simple. If the IMA Service has been restarted or terminated unexpectantly the Citrix COM+ components will sometimes fail to refresh. To resolve this perform te follwoing steps:

  • Terminate ConfigMgrSvc.exe’ using Task Manager
  • Re-open the Citrix Access Management Console and Run Discovery.

 Further information is availabke here: http://support.citrix.com/article/CTX116752

Categories
General Joomla Articles

Joomla : Improve Site Performance

Joomla : Improve Site Performance

1) Enable Joomla Site Caching

Open the Joomla Administrator, select the Site menu, then Global Configuration, then System. I’ve configured caching with a 15 minute cache life.

For sites with data which changes infrequently, use a higher time.

2) Enable Apache PHP/HTML compression

Edit the .htaccess file in the root of your Joomla Website, modify the IfModule mod_php4.c section to include the line ‘php_flag zlib.output_compression on’ – for example:

php_value max_execution_time      120
php_flag zlib.output_compression on

Categories
SQL

SQL : AUTO STATS Troubleshooting

SQL : AUTO STATS Troubleshooting

Statistics for Query Optimization are objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view. The query optimizer uses these statistics to estimate the cardinality, or number of rows, in the query result. These cardinality estimates enable the query optimizer to create a high-quality query plan.

To view the indexes and their last update time use the following command:

 SELECT
        o.name AS Table_Name
       ,i.name AS Index_Name
       ,STATS_DATE(o.id,i.indid) AS Date_Updated

FROM
        sysobjects o JOIN
        sysindexes i ON i.id = o.id

WHERE
        xtype = ‘U’ AND
        i.name IS NOT NULL

ORDER BY
        o.name ASC
       ,i.name ASC

To update statistic:

  •     On a single index: UPDATE STATISTICS [Table_Name][INDEX_NAME]
  •     On a single table: UPDATE STATISTICS[TABLE_NAME]
  •     On the entire Database: EXEC sp_updatestats

 

 

Categories
SQL

SQL : Common Wait Types

SQL : Common SQL Wait Types

Using the command DBCC SQLPERF(WAITSTATS) you can get real time totals of all wait types within an SQL instance. The table below outlines some common wait types and what to look for when troubleshooting them. 

NETWORKIO

Waiting on network I/O completion. Waiting to read or write to a client on the network. Check NIC saturation.

LCK_x

Check for memory pressure, which causes more physical I/O, thus prolonging the duration of transactions and locks. View the following counter: Lock Wait Time (ms)

CXPACKET

On OLTP (On-line Transaction Processing) systems, if CXPACKET wait-types account for more than 5% of total locks SQL CPU parallelism can cause performance degradation. On Data WH systems if CXPACKET waits are 10% or greater then it is likely that parallelism is the cause of the performance issues.

 

Look to reduce parallelism or even disable it on OLTP systems. OLTP systems best practice is to disable it completely.

I/O_COMPLETION

Indicates an I/O bottleneck. Identify disk bottlenecks, using IoStallMS values and Perfmon counters:

  • PhysicalDisk: Disk Sec/read, Disk Sec/write, Disk Queues
  • SQL Server Buffer Manager: Page Life Expectancy, Checkpoint Pages/sec, Lazy Writes/sec

·         SQL Server Access Methods for correct indexing: Full Scans/sec, Index Searches/sec

  • See the Memory: Page Faults/sec

 

Identify IoStallMS values using the following query:

/* Find DB ID */

SELECT DB_ID(‘SQLTraceDB’) as [Database ID]

/* Display IoStallMS */

SELECT * FROM ::fn_virtualfilestats(7,-1)

EXCHANGE

Check for parallelism using sp_configure ‘max degree of parallelism’. Disable parallelism by setting 1 or change to # of CPU’s /2.

ASYNC_I/O_COMPLETION

Waiting for asynchronous I/O requests to complete. Identify disk bottlenecks, using IoStallMS values and Perfmon Counetrs:

  • PhysicalDisk: Disk Sec/read, Disk Sec/write, Disk Queues
  • SQL Server Buffer Manager: Page Life Expectancy, Checkpoint Pages/sec, Lazy Writes/sec

·         SQL Server Access Methods for correct indexing: Full Scans/sec, Index Searches/sec

  • See the Memory: Page Faults/sec

 

Identify IoStallMS values using the following query:

/* Find DB ID */

SELECT DB_ID(‘SQLTraceDB’) as [Database ID]

/* Display IoStallMS */

SELECT * FROM ::fn_virtualfilestats(7,-1)

 

Full Scans/Sec
This counter should always be captured. It shows how often a table index is not being used and results in sequential I/O. This is defined as the number of unrestricted full scans. These can be either base table or full index scans. Missing or incorrect indexes can result in reduced performance because of too high disk access.

 

Page Life Expectancy

According to Microsoft, 300 seconds is the minimum target for page life expectancy. If the buffer pool flushes your pages in less than 300 seconds, you probably have a memory problem.

 

WAITFOR

Check common SQL Stored Procedures for WAIT FOR DELAY statements. This is a fixed delay within a stored procedure.

PAGELATCH_x

Usually indicates cache contention issues

PAGEIOLATCH_x

Usually indicates I/O issues. Check disk subsystem.

OLEDB

See Disk secs/Read and Disk secs/Write. If Disk secs/Read is high, add additional I/O bandwidth, balance I/O across other drives.

 

To get the Transact-SQL statement involved in OLEDB waits:

DECLARE @Handle binary(20)

SELECT @Handle = sql_handle FROM sysprocesses

WHERE waittype = 0x0042

SELECT * FROM ::fn_get_sql(@Handle)

 

Categories
SQL

SQL : Troubleshooting Waits

SQL Troubleshooting Wait Time

The following SQL script will export the output of ‘DBCC SQLPERF(WAITSTATS)’ into a temporary table named #waitstats.

CREATE TABLE #waitstats
(
    [Wait Type]        nvarchar(32) not null,
    [Requests]         float not null,
    [Wait Time]        float not null,
    [Signal Wait Time] float not null
)

INSERT INTO #waitstats EXEC(‘dbcc sqlperf(waitstats)’)

SELECT * FROM #waitstats
ORDER BY ‘Wait Time’ DESC

To delete the temporary table use the command:

DROP TABLE #waitstats 

Categories
SQL

SQL : Troubleshooting Index Fragmentation

The following SQL script will identify fragmented indexes and automaticallyrebuild them. Note that this should not be run duringbusy periods/production hours and should only be run against databases which are not partitioned:

{code lang:sql showtitle:false lines:false hidden:false}declare @tablename varchar (128)
declare @id int
declare @Cnt smallint
declare @nrc_mask int
declare tables cursor for
select TABLE_SCHEMA + ‘.’ + TABLE_NAME
from INFORMATION_SCHEMA.TABLES
where TABLE_TYPE = ‘BASE TABLE’

create table #spt_space
( aidi int,
[rows] int null,
reserved dec(15) null,
AUTOSTATS char(2) null,
[Last Updated] datetime
)
open tables
/*
** Set NORECOMPUTE mask
*/
set @nrc_mask = 16777216

fetch next
from tables
into @tablename
while @@fetch_status = 0
begin

select @id = object_id(@tablename)

insert into #spt_space (aidi, reserved)
select @id, sum(reserved)
from sysindexes
where indid in (0, 1, 255)
and id = @id
update #spt_space set [rows] = i.[rows]
from #spt_space inner join sysindexes i
on #spt_space.aidi = i.id
where i.indid =indid and i.id = @id

update #spt_space set [AUTOSTATS] =
case (si.status & @nrc_mask)
when @nrc_mask then ‘OFF’
else ‘ON’
end,
[Last Updated] = stats_date(#spt_space.aidi, si.indid)
from sysindexes si inner join #spt_space
on #spt_space.aidi = si.id
where si.id = @id AND — Table
si.indid BETWEEN 1 AND 254 — Skip HEAP/TEXT index
fetch next
from tables
into @tablename
end

— Close and deallocate the cursor
close tables
deallocate tables

CREATE TABLE #fraglist (
ObjectName CHAR (255),
ObjectId INT,
IndexName CHAR (255),
IndexId INT,
Lvl INT,
CountPages INT,
CountRows INT,
MinRecSize INT,
MaxRecSize INT,
AvgRecSize INT,
ForRecCount INT,
Extents INT,
ExtentSwitches INT,
AvgFreeBytes INT,
AvgPageDensity INT,
ScanDensity DECIMAL,
BestCount INT,
ActualCount INT,
LogicalFrag DECIMAL,
ExtentFrag DECIMAL)

CREATE TABLE #bigtables (i int NOT NULL IDENTITY(1,1), tablename varchar(128))
INSERT #bigtables(tablename)
SELECT object_name(aidi) as tablename
FROM #spt_space
–WHERE rows >= 100000

SELECT * FROM #spt_space
DROP TABLE #spt_space
SELECT @Cnt = count(*) FROM #bigtables

DECLARE @DTime datetime
Set @DTime = GETDATE()

SET @id = 1
WHILE @id <= @Cnt
BEGIN
SELECT @tablename = tablename FROM #bigtables WHERE i = @id
INSERT INTO #fraglist
EXEC (‘DBCC SHOWCONTIG (”’ + @tablename + ”’)
WITH TABLERESULTS, ALL_INDEXES, NO_INFOMSGS, FAST’)
SET @id = @id + 1
END

SELECT ObjectName, IndexName, CountPages, ExtentSwitches, BestCount, ActualCount, LogicalFrag FROM #fraglist ORDER BY LogicalFrag DESC
DROP Table #bigtables
DROP table #fraglist{/code}

Use the following code to rebuild indexes that are 10% or more fragmented (you can change this threshold by modifying the figure highlighted in RED):

{code lang:sql title:”SQL Query – Fix Fragmentation” lines:false hidden:false}declare @tablename varchar (128) 

declare @id int 
declare @Cnt smallint 
declare @nrc_mask int 
declare tables cursor for 
select TABLE_SCHEMA + ‘.’ + TABLE_NAME 
from INFORMATION_SCHEMA.TABLES 
where TABLE_TYPE = ‘BASE TABLE’ 

create table #spt_space 
( aidi int, 
[rows] int null, 
reserved dec(15) null, 
AUTOSTATS char(2) null, 
[Last Updated] datetime 

open tables 
/* 
** Set NORECOMPUTE mask 
*/ 
set @nrc_mask = 16777216 

fetch next 
from tables 
into @tablename 
while @@fetch_status = 0 
begin 

select @id = object_id(@tablename) 

insert into #spt_space (aidi, reserved) 
select @id, sum(reserved) 
from sysindexes 
where indid in (0, 1, 255) 
and id = @id 
update #spt_space set [rows] = i.[rows] 
from #spt_space inner join sysindexes i 
on #spt_space.aidi = i.id 
where i.indid =indid and i.id = @id 

update #spt_space set [AUTOSTATS] = 
case (si.status & @nrc_mask) 
when @nrc_mask then ‘OFF’ 
else ‘ON’ 
end, 
[Last Updated] = stats_date(#spt_space.aidi, si.indid) 
from sysindexes si inner join #spt_space 
on #spt_space.aidi = si.id 
where si.id = @id AND — Table 
si.indid BETWEEN 1 AND 254 — Skip HEAP/TEXT index 
fetch next 
from tables 
into @tablename 
end 

— Close and deallocate the cursor 
close tables 
deallocate tables 

CREATE TABLE #fraglist ( 
ObjectName CHAR (255), 
ObjectId INT, 
IndexName CHAR (255), 
IndexId INT, 
Lvl INT, 
CountPages INT, 
CountRows INT, 
MinRecSize INT, 
MaxRecSize INT, 
AvgRecSize INT, 
ForRecCount INT, 
Extents INT, 
ExtentSwitches INT, 
AvgFreeBytes INT, 
AvgPageDensity INT, 
ScanDensity DECIMAL, 
BestCount INT, 
ActualCount INT, 
LogicalFrag DECIMAL, 
ExtentFrag DECIMAL) 

CREATE TABLE #bigtables (i int NOT NULL IDENTITY(1,1), tablename varchar(128)) 
INSERT #bigtables(tablename) 
SELECT object_name(aidi) as tablename 
FROM #spt_space 
–WHERE rows >= 100000 

SELECT * FROM #spt_space 
DROP TABLE #spt_space 
SELECT @Cnt = count(*) FROM #bigtables 

DECLARE @DTime datetime 
Set @DTime = GETDATE() 

SET @id = 1 
WHILE @id <= @Cnt 
BEGIN 
SELECT @tablename = tablename FROM #bigtables WHERE i = @id 
INSERT INTO #fraglist 
EXEC (‘DBCC SHOWCONTIG (”’ + @tablename + ”’) 
WITH TABLERESULTS, ALL_INDEXES, NO_INFOMSGS, FAST’) 
SET @id = @id + 1 
END 

SELECT ObjectName, IndexName, CountPages, ExtentSwitches, BestCount, ActualCount, LogicalFrag FROM #fraglist ORDER BY LogicalFrag DESC 

DROP Table #bigtables 

/* ********************************* */
/* Perform the REINDEX – NOTE This will lock the Table in which the index resides */
/* ********************************* */

DECLARE @objectid int, @indexname varchar(200), @frag decimal, @SQL varchar(500)
DECLARE indexes CURSOR FOR
SELECT ObjectName, ObjectId, IndexName, LogicalFrag
FROM #fraglist
WHERE LogicalFrag >= 10
AND INDEXPROPERTY (ObjectId, IndexName, ‘IndexDepth’) > 0;
— Open the cursor.
OPEN indexes;
— Loop through the indexes.
FETCH NEXT
FROM indexes
INTO @tablename, @objectid, @indexname, @frag;
WHILE @@FETCH_STATUS = 0
BEGIN;
PRINT ‘Executing DBCC DBREINDEX (‘ +RTRIM(@tablename) + ‘,”’ + RTRIM(@indexname)
+ ”’) – LogicalFrag Currently: ‘ + RTRIM(CONVERT(VARCHAR(15),@frag)) + ‘%’;

SELECT @SQL = ‘DBCC DBREINDEX (‘ +RTRIM(@tablename) + ‘,”’ + RTRIM(@indexname) + ”’)’
EXEC (@SQL)

FETCH NEXT
FROM indexes
INTO @tablename, @objectid, @indexname, @frag;
END;
— Close and deallocate the cursor.
CLOSE indexes;
DEALLOCATE indexes;
— Delete the temporary table.
DROP TABLE #fraglist;
GO

EXEC sp_updatestats

{/code} 

 
Categories
Exchange Server 2007

Exchange 2007 : Error: The Exchange server address list service failed to respond

Exchange 2007 :  Error: The Exchange server address list service failed to respond

During a DR simulation I recently came across the following error on an Exchange 2007 CCR cluster:

Error: The Exchange server address list service failed to respond. This could be because of an address list or email address policy configuration error.

The issue was that the System Attendandt Service had lost connection to the domain controller to which it was using. For some reason it did not automatically onnect to another DC. All other Exchange Functions were working, but management of users, groups and system objects was impossible.

To resolve this issue simply restart the Exchange System Attendant Instance service for the cluster using the Cluster Administrator tool.

This will allow the service to re-attch to an available Domain Controller.

Categories
Performance

VMWare : Troubleshooting VM Performance

VMWare Troubleshooting VM Performance in ESX/vSphere

Firstly we’re going to be using esxtop, which can be executed from the local console or an SSH session on ESXi, as well as vCenter.

General guidelines;

  1. Physical : vCPU ratio should be around 1:5
  2. Avoid memory oversubscription in production environments
  3. Reserve memory for virualised SQL servers using Lock Pages in Memory
  4. Reserve memory for virtualised RDS/Citrix Servers
  5. Always create datastores using the vSphere client, this will ensure that the VMFS partition is aligned – this will reduce IO and potentially increase performance.
  6. Align guest data disks (for Windows, any version earlier than 2008 must be manually aligned) – this will reduce IO and potentially increase performance.
  7. Look to keep the number of VM’s per Datastore/LUN to around 10-15, this will help to reduce SCSI reservation contention.
  8. vCPU’s ; less is more. From my own testing I’ve found that Citrix servers perform best with 2 vCPU’s over 4 vCPU’s. This is not only better for my users but also the ESXi hosts as there is less co-scheduling.

Lets begin…..

1) Investigate CPU contention/exhaustion using esxtop (press ‘c’ from esxtop, and shift-v for per-VM stats only):

  1. Check host PCPU usage using esxtop
  2. Look at %RDY, if this is equal or greater than 10% there is a performance issue – this can indicate CPU contention.
  3. Look at %MLMTD, if this is high it would indicate a CPU limit is being imposed on the VM. %RDY – %MLMTD gives a true indication of CPU contention.
  4. If %RDY is truluy above 10% the first step is to lower the number of active vCPUs configured on the ESX/vSphere server, next you’re looking at reducing the number of VM’s on the server.
  5. Investigate co-scheduling/SMP related issue – are VM’s using all presented vCPU’s? From esxtop press ‘c’ then ‘e’  – then take a look at %CSTP. If these values are high this could indicate issue as this represents the overhead in co-scheduling CPU’s from a co-stopped to co-started state.

For example, if you have 16 cores, the maximum vCPUs that should be defined across all active VMs should not exceed 80.

2) Investigate memory usage via esxtop and vCenter (press ‘m’ from esxtop, and ‘shift-v’ for per-VM stats only):

  1. Check host memory utilisation using esxtop
  2. For full-fat ESX only – the service console may be low on RAM, you can adjust this by following these instructions: http://www.vmware.com/pdf/esx_performance_tips_tricks.pdf
  3. Watch out for memory balooning, this can have a significant impact on VM performance. You can track memory balooning in vCenter and esxtop; MCTLTGT is the VMKernel’s desired memory baloon size, MCTLSZ is the actual size. If the target is greater than the size the baloon is increased/inflated, if it is smaller it is decreased/deflated. VM memory limits can also trigger balooning.
  4. Transparent Page Sharing (TPS) allows a host to share memory with other VM’s on the host – only used when memory resources are low/overcommitted.
  5. Check esxtop for SWCUR (currently used SWAP), SWTGT and SWCUR. If SWTGT is less than SWCUR swapping will take place. Swapping is slow so should be avoided at all costs.  If sawpping is unavoidable use SSD’s; There’s a -12% degradation with local SSD versus -69% for Fiber Channel and -83% for local SATA storage. (more information here)
  6. SWPWT represents the ammount of time a Virtual Machine is waiting for memory to be swapped in and should always be below 5%
  7. SWR/SWW represent Swap Reads/Writes from disk to memory and vice versa.

3) Using esxtop investigate storage (press ‘u’ for per-datastore or ‘d’ for per-hba stats:

  1. Investigate DAVG – represents the roud-trip time bewteen HBA and storage, should be less than 30ms ideally
  2. Investigate KAVG – represents actual latency due to VMKernal
  3. Investigate GAVG – represents the round-tripfor IO requests sent form the host to storage, again lower is better, ideally less than 30ms.
  4. Check the CONS/s – this indictaes SCSI reservation conflicts generated by metadata updates on the same LUN at a given time.
  5. vscsiStats (more info here)  will report per-VMDK/RDM

4) Finally, consider the network subsystem:

  1. Check bandwidth availability
  2. Using esxtop check %DRPTX and %DRPR, if the latter is high consider increasing the Rx buffer from device manager (yes, Windows only…?linux configuration) on the VM

If all else fails check advisories on your hardware platform, I’ve run into issues in the past that have been device firmware specific so dont rule out the siplist of things.

UPDATE 22/02/2010 : Check out the new esxtop article here for further performance troubleshooting tips.