Our XenServer (version 6.0) pool went bad. The word 'bad' - while not precise is the best way to describe it.
The planned configuration was:
- 3 servers, each with 3 nics - runing XenServer v6.0
- 2 iSCSI shared storage servers
- 2 of the NICS bonded providing the management interface and added to each guest (VM) system.
- 1 if the NIC for a dedicated storage network.
- A dedicated storage network (addresses 192.168.40.x)
We had removed a server from the pool to upgrade the hardware (specifically to swap the hard drive for a 60Gb SSD drive).
The
suggested procedure is to remove the server from the pool, upgrade the hardware, reinstall XenServer and then add it back to the pool.
We then reinstalled exactly the same version of Xenserver. When it started, it could see the 3 network intefaces.
When we added it back to the pool the new server couldn't see all of the networks.
Either the primary interface wasn't working reliably after a reboot, or the storage interface was not available.
On other attempts the primary interface would connect ok, but the 3rd storage interface would not.
We tried various combinations of first bonding the first 2 NICs before adding it to the pool, or adding it without any configuration.
I tested this process in a test system mentioned below (with 2 servers and 1 storage array) and never had any problems removing and adding back a server to the pool.
This was annoying as nothing else was changed (particularly with the network interfaces), just a new hard disk and a fresh install of XenServer.
After much research the decision was made to recreate the XenServer pool.
To summarise this involves:
- backing up the VM metadata.
- though the VM data on the storage array should be unaffected, ensure it is all is backed up.
- for each server - remove it from the pool, install XenServer
- create a new pool, and
- restore the storage repositories and VM metadata.
This post details this procedure.
1. Preparation.
The pool was running, but in a degraded state - with 2 of the 3 servers operational As this is an 'elective' procedure we were able to carefully plan, and in particular the procedure was practised on a test system.
The test system had 2 servers, 1 storage array, each server having 2 NICs; supporting had 3 VMs (Centos systems each with 8Gb of storage).
It was set up much as per this
post, there wasn't much difference between v6.0 and v6.1.
1.1 Testing.
The testing went as expected with one problem.
When we attempted to restore the virtual machine metadata for all of the servers using the console (xsconsole) Restore Virtual Machine Metadata command, 1 of the 3 VMs would not restore.
In hindsight I believe this VM had a drive attached to local storage on one of the servers.
What I decided to do is to save the metadata for the VMs individually using cli vm-export command in addition to the xsconsole command that backs up the metadata to disk on the storage repository.
The vm export command:
xe vm-export filename=vm_testserv2 uuid=ac7fc085-aa66-826d-5a69-599c5af20118 metadata=true
This prove to be invaluable later.
As part of the testing we upgraded the test pool from v6.0 to v6.1 (skipping v6.0.2) - the same version of Xenserver as the live pool. I wanted to be sure that there wouldn't be any problem accessing the storage repositories and the VM metadata backups and restarting the virtual machines themselves.
1.2 Documentation.
We checked and documented the following:
- VM storage configuration.
- VM (in this case Centos) kernel levels, Centos versions older than 5.4 can have problems with XenServer versiosn 6.02 and 6.1
- Ejected any mounted virtual machine iso images.
- Storage repository configuration - ip addresses, names and connection details.
- Storage repository configuration - as screenshots from XenCenter, if under stress later a pictorial representation later can be helpful. We had 7 virtual machines, so taking the screenshots was manageable.
- Saved lists from xen command line utilities, this included: xe sr-list; xe vm-list
- Backup (dump) the pool ( pool-dump-database) a metadata backup
- Run a server status report and upload it to https://taas.citrix.com/ - Citrix Autosupport
|
Example Server disk screenshot. |
One thing we didn't record specifically was ip addresses for the VMs, during the recovery process we did need this information. We retrieved it from other records.
2. The Process.
The following steps were done:
- Check no disks are on any of the local storage arrays. Sometimes templates - if created when the servers were first built can remain on local storage.
- Recheck and eject any mounted VM CD drives (as mentioned above).
- Do another backup of the virtual machines to a storage repository, also backup each vm individually:
xe vm-export filename=vm_custweb uuid=ac7fc085-aa66-826d-5a69-599c5af20118 metadata=true
- Remove one server from the pool (or in our case the newly installed server that we couldn't rejoin to the pool).
- Install Xenserver 6.1, give it the same networking details (ip address, gateway, dns settings).
- On the old servers in the running pool, stop all virtual machines and detach - disconnect) from all of the shared storage repositories. For now we left them running.
- On the new server configure the networking (note however I didn't create the bonded network interface at this point).
- On the new server, connect to the storage repositories.
- Restore the virtual machines - (xs console, see screenshot below)
- Start one of the VMs.
- Then shutdown the old Xenservers, install Xenserver 6.1 and add them to the new pool.
|
VM Metadata restore |
3. Problems.
VM Metadata Restoration.
Of our 7 virtual machines, one would not restore. It was known to not have any local storage. Apart from the fact it was one of the earlier (oldest) systems it didn't have any differences from the other VMs.
In this case it was able to be easily restore from the vm metadata export.
xe vm-import filename=vm_custweb metadata=true
This restore was able to work without any problem.
VM Networking.
One mistake we did do was to not fully reconfigure the networking before doing the VM metadata restore.
When we restored the VMs, they were not attached to a network because I hadn't recreated the bonded interface they were expecting.
I tried the configuration as it was with a test server - when it started the Centos system didn't have any network interfaces, later even after I had recreated the bonded interface, the other systems also didn't have any network interfaces. If I had configured it before doing the metadata restore I believe the restore would have been complete.
Our test system only had 2 NICs - the primary interface for the VMs was not bonded. If it had we would have picked up this problem in testing.
What I had to do - after configuring the networking, is to check each virtual machine had a network interface, and when the guest OS had started, reconfigure the networking - configure the IP address/netmask and default gateway. The dns configuration - not being tied to a particular interface was still in place.
This was manageable for 7 VMs of which only 1 had more than one interface. For a larger number I would have dropped them and re-restored them. Reconfiguring a larger number of VMs manually would have been error prone.
|
Network Configuration |
Unreable Guest Server Attached Disk.
This was another error that wasn't anticipated and doesn't have a known cause.
One of the VMs when starting was unable to mount a partition. An fsck run from the single user command prompt also failed.
We edited the /etc/fstab file and the system started without the drive.
[ To edit the fstab in this situation, you need to remount the root partition re-write, i.e. from single user mode
# mount -o remount,rw /dev/sda1 / ]
We did have a copy of the data but I investigated first.
The virtual machine restore of the metadata did include the drive however a listing from fdisk showed:
[root@webtest ~]# fdisk -l
....
Disk /dev/xvdc: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/xvdc doesn't contain a valid partition
table
I first tried the rescue command in parted:
parted /dev/xvdc
rescue
Using
/dev/xvdc
Welcome to GNU Parted! Type 'help' to view a list
of commands.
(parted) p
Model: Xen Virtual Block Device (xvd)
Disk /dev/xvdc: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Number
Start End Size
File system Flags
1 0.00kB
42.9GB 42.9GB ext3
This returned the expected partition - a single ext2/ext3 partition for the entire disk. However a fsck still failed.
Once again there was no obvious reason why this partition should have problems. It hadn't been resized or any other changes. One possbility is that it may have been moved from another storage repository - as opposed to being created and formatting on its current repository.
I then attempted to recover the partition using gpart, which did work.
The full trace of the restore session is below, I had first run it in read mode before specifying write mode (-W <device>)
[root@webtest ~]# gpart -W /dev/xvdc /dev/xvdc
Begin scan...
Possible partition(Linux ext2), size(40954mb),
offset(0mb)
End scan.
Checking partitions...
Partition(Linux ext2 filesystem): primary
Ok.
Guessed primary partition table:
Primary partition(1)
type:
131(0x83)(Linux ext2 filesystem)
size:
40954mb #s(83875296) s(63-83875358)
chs: (0/1/1)-(1023/254/63)d
(0/1/1)-(5220/254/57)r
Primary partition(2)
type:
000(0x00)(unused)
size:
0mb #s(0) s(0-0)
chs: (0/0/0)-(0/0/0)d
(0/0/0)-(0/0/0)r
Primary partition(3)
type:
000(0x00)(unused)
size:
0mb #s(0) s(0-0)
chs: (0/0/0)-(0/0/0)d
(0/0/0)-(0/0/0)r
Primary partition(4)
type:
000(0x00)(unused)
size:
0mb #s(0) s(0-0)
chs: (0/0/0)-(0/0/0)d (0/0/0)-(0/0/0)r
Edit this table (y,n) : y
Edit which partition (1..4, q to quit) : 1
Warning: entered values will not be checked;
enter at your own risk!
1 - Absolute start sector ( 63)
2 - Absolute sector count ( 83875296)
3 - Partition type (
131)(Linux ext2 filesystem)
Edit which value (1..3, q to quit) : 1
Enter value for 1 : 63
Warning: entered values will not be checked;
enter at your own risk!
1 - Absolute start sector ( 63)
2 - Absolute sector count ( 83875296)
3 - Partition type ( 131)(Linux ext2 filesystem)
Edit which value (1..3, q to quit) : 3
Enter value for 3 : 131
Warning: entered values will not be checked;
enter at your own risk!
1 - Absolute start sector ( 63)
2 - Absolute sector count ( 83875296)
3 - Partition type ( 131)(Linux ext2 filesystem)
Edit which value (1..3, q to quit) : q
Primary partition(1)
type:
131(0x83)(Linux ext2 filesystem)
size:
40954mb #s(83875296) s(63-83875358)
chs: (0/1/1)-(1023/254/63)d
(0/1/1)-(5220/254/57)r
Primary partition(2)
type:
000(0x00)(unused)
size:
0mb #s(0) s(0-0)
chs: (0/0/0)-(0/0/0)d
(0/0/0)-(0/0/0)r
Primary partition(3)
type:
000(0x00)(unused)
size:
0mb #s(0) s(0-0)
chs: (0/0/0)-(0/0/0)d
(0/0/0)-(0/0/0)r
Primary partition(4)
type:
000(0x00)(unused)
size:
0mb #s(0) s(0-0)
chs: (0/0/0)-(0/0/0)d
(0/0/0)-(0/0/0)r
Edit which partition (1..4, q to quit) : q
Activate which partition (1..4, q to quit) : 1
Write this partition table (y,n) : y
After this I was able to check and mount the partition:
[root@webtest ~]# fsck.ext3 /dev/xvdc1
e2fsck 1.39 (29-May-2006)
/dev/xvdc1: clean, 1394/5242880 files,
8291901/10484412 blocks
While this process worked ok there is no substitution for having a complete backup of the system.
Conclusion.
To conclude the process went well and the pool is now in an improved state -with all 3 servers in it.
The exercise was a valuable one in terms of practising a disaster recovery scenario and improving our knowledge and confidence with Xenserver.
I would recommend the procedure - done on a test system -to anyone responsible for managing a virtual server pool.
Other References.
The documents: