How to - Ceph - Configure Ceph on a new drive

source: https://ceph.com/geen-categorie/admin-guide-replacing-a-failed-disk-in-a-ceph-cluster/

Remove the OSD of the faulty drive

If you are replacing a faulty drive with a new one, you will need to remove the OSD of the faulty drive before proceeding with creating the new OSD.

*Requirement: The faulty SSD must have been replaced with a healthy SSD.

  1. Login to the Ceph node with the faulty drive.

  2. Identify the device name of the faulty drive by running the following command on the Ceph node:

    Commands to run on the Ceph node:

sudo ceph osd tree

lsblk

Expected outcome:
Identify the device label used by the new SSD. If it do not match the label the faulty drive had, you will have to take action later on in this procedure.

  1. Login to one Ceph monitor

  2. Verify the Ceph cluster health

    Commands to run on the Ceph monitor:

sudo ceph status

Expected outcome:
The clust must be HEALTH_OK
If the cluster is not OK, you must fix other issues before proceeding with the next steps.

  1. Remove the OSD from the cluster and stop the OSD service.

    Commands to run on the Ceph monitor: (replace XX by the OSD number)
sudo ceph osd out osd.XX

sudo service ceph stop osd.XX
  1. Remove the OSD from the cluster

    Commands to run on the Ceph monitor:
sudo ceph osd crush remove osd.XX
  1. Delete keyrings for that OSD and finally remove OSD.

    Commands to run on the Ceph monitor:
sudo ceph auth del osd.XX

sudo ceph osd rm osd.XX

Expected outcome:
The output should look like this...

storage-Y.ZZZZ:\~\$ sudo ceph auth del osd.XX

updated

storage-Y.ZZZZ:\~\$ sudo ceph osd rm osd.XX

removed osd.XX
  1. Logout of the Ceph monitor

  2. Login to the Ceph node

  3. You will need to unmout the faulty drive from the OS.

List the block device on the host.

Commands to run on the Ceph node:

lsblk

Unmount the faulty device using the mountpoint found at the previous command.

Commands to run on the Ceph node:

sudo umount \<mountpoint\>
  1. Logout of the Ceph node

  2. Login to a Ceph monitor

  3. In case the storage node is affected by the maintenant done by the datacenter technician, you do not want CRUSH to automatically rebalance the cluster if the storage node gets taken out.
    On the Ceph monitor, disable automatically rebalance of the cluster and noout.

    Commands to run on the Ceph monitor:

sudo ceph osd set noout
sudo ceph osd set norebalance

Expected outcome:
By runing the command ceph status, the flags must be present.

sudo ceph status            
    cluster XYZ
     health HEALTH_WARN
            noout,norebalance flag(s) se
  1. Wait for the health of the cluster to be warning "health HEALTH_WARN".

    To monitor the Ceph cluster health, run on the Ceph monitor:
sudo ceph status

Expected outcome:
The cluster health should be health HEALTH_WARN

  1. Logout of the Ceph node

  2. Wait for the datacenter technician to replace the faulty drive.

  3. Login to the Ceph monitor

  4. After the drive has been replaced, list the disk on the storage node and find the device name of the new drive that was inserted.

    Commands to run on the Ceph monitor:

sudo -i
su - cephadmin
cd ceph-cluster
ceph-deploy disk list <storage_name>

Expected outcome:
Look for the device showing "other, unknown".
For example [storage-XX\]\[DEBUG \] /dev/sdf other, unknown.
The device label must match the one used by the previous faulty drive.

  1. Zap the new drive and deploy it in the Ceph cluster.

    Commands to run on the Ceph monitor (using the user cephadmin, see step 15):
ceph-deploy disk zap <storage_name>:<device_name>

\# example: ceph-deploy disk zap storage-Y:sdx

Expected outcome:

The output of the disk zap should look like this

[storage-Y][DEBUG ] GPT data structures destroyed! You may now partition the disk using fdisk or
[storage-Y][DEBUG ] other utilities.
[storage-Y][DEBUG ] Creating new GPT entries.
[storage-Y][DEBUG ] The operation has completed successfully.
[ceph_deploy.osd][DEBUG ] Calling partprobe on zapped device /dev/sdx
[storage-Y][DEBUG ] find the location of an executable
[storage-Y][INFO  ] Running command: sudo /sbin/partprobe /dev/sdx
  1. Commands to run on the Ceph monitor (using the user cephadmin, see step 15):
ceph-deploy --overwrite-conf osd prepare <storage_name>:<device_name>
 
# example: ceph-deploy --overwrite-conf osd prepare storage-Y:sdx

Expected outcome:

The osd prepare should finish and print the following line

[ceph_deploy.osd][DEBUG ] Host storage-Y is now ready for osd use.

The osd should be added to the cluster. To verify if the osd was added, run the command

sudo ceph osd tree
  1. Enable automatic rebalance the cluster after the maintenance.
    Commands to run on the Ceph monitor using the user nakina or root (remove sudo from commands):
sudo ceph osd unset noout
sudo ceph osd unset norebalance

Expected outcome:
The output should look like this...

storage-Y.ZZZZ:~$ sudo ceph osd unset noout
unset noout
storage-Y.ZZZZ:~$ sudo ceph osd unset norebalance
unset norebalance
  1. Monitor the rebalancing of the Ceph cluster
sudo ceph status