It is not possible to take advantage of NVMe SSD bandwidth with single OSD. 4 is the optimum number of partitions per SSD drive that gives best possible performance.
http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning
Ceph has a very interesting article regarding SSD/NVMe storage. It does not seem to mention however how these partitions were created in the first place so here's what I've come up with.
Partitioning
Say you have a 2TB NVMe device you'd like to split into two OSDs.
# define a disk variable
disk=nvme2n1
# zap the disk
ceph-disk zap /dev/$disk
# create partitions for OSD metadata and storage
parted /dev/$disk -a optimal mkpart primary 1049kB 1GB
parted /dev/$disk -a optimal mkpart primary 1GB 1000GB
parted /dev/$disk -a optimal mkpart primary 1000GB 1001GB
parted /dev/$disk -a optimal mkpart primary 1001GB 2000GB
In the above example, I am creating ~1GB partitions for OSD metadata and ~1000GB partitions for OSD storage. I realize I could've dedicated less space to the metadata - I'm just trying to keep it simple.
Creating the OSDs
Now that we've created the partitions, let's build basic structure required for the OSDs. The process is as follows.
1. Get UUIDs of the block partitions to use for OSD storage
# ls -l /dev/disk/by-partuuid/ | grep ${disk}p
lrwxrwxrwx 1 root root 15 Mar 6 16:46 1b92e750-6fbc-48bf-92ff-224c9cfaf6ed -> ../../nvme2n1p1
lrwxrwxrwx 1 root root 15 Mar 6 16:46 be1d844a-5b21-4fe1-90e1-e04baaf7105e -> ../../nvme2n1p2
lrwxrwxrwx 1 root root 15 Mar 6 16:46 0e620255-ac4d-4b24-b0bb-3e07ac5c9d5d -> ../../nvme2n1p3
lrwxrwxrwx 1 root root 15 Mar 6 16:46 357bd8be-673c-4b12-8ea0-fbe1e0152fa5 -> ../../nvme2n1p4
Save UUIDs of the second and fourth partition - we will need to explicitly tell Ceph to use these for OSD data storage.
2. Prepare the OSD metadata
uuid1=be1d844a-5b21-4fe1-90e1-e04baaf7105e
uuid2=357bd8be-673c-4b12-8ea0-fbe1e0152fa5
ceph-disk prepare --bluestore /dev/${disk}p1 --block-uuid $uuid1
ceph-disk prepare --bluestore /dev/${disk}p3 --block-uuid $uuid2
I'm assigning the UUIDs to a variable for convenience - they will be required later again.
Mount the OSD metadata into a temporary directory - ensure the symbolic link is present:
# mkdir -p temp
# mount /dev/${disk}p1 temp/
# ls -l temp/
(...)
lrwxrwxrwx 1 ceph ceph 58 Jan 26 09:33 block -> /dev/disk/by-partuuid/be1d844a-5b21-4fe1-90e1-e04baaf7105e
(...)
# umount temp
If it isn't, create it manually with ln
3. Set permissions on device and activate the device
chown ceph.ceph -R /dev/${disk}*
ceph-disk activate /dev/${disk}p1
ceph-disk activate /dev/${disk}p3
That should be it for creation of the OSD. You can now adjust the crush position, device class et cetera.
Ensuring the OSDs start on boot
In my experience Ceph does not yet start NVMe OSDs on boot. You could create an entry in /etc/rc.local or use this tiny script (put it somewhere and add it to /etc/rc.local) I've created for this exact purpose (if you use kickstart to manage services on your nodes):
#!/bin/bash
# mounts and starts NVMe osds I guess
# kickstart only for now
# add execution of this script to /etc/rc.local
function log {
echo "nvmeinit: $1"
logger -t "nvmeinit" "$1"
}
# check if we have NVMe drives in to begin with
if [ -z "$(ls /dev/nvme*)" ]; then
log "wasn't able to detect NVMe drives, aborting"
exit
fi
workdir="/tmp/.$$.nvmeinit"
chown ceph.ceph /dev/nvme*
mkdir -p $workdir
while read nvmepart; do
ismounted=$(mount | grep "^$nvmepart")
if [ ! -z "$ismounted" ]; then
log "$nvmepart already mounted, skipping"
continue
fi
xfs=$(file -s $nvmepart -b | grep XFS)
if [ -z "$xfs" ]; then
continue
fi
mount $nvmepart $workdir
osd=$(cat $workdir/whoami 2>/dev/null)
umount $nvmepart
if [ -z "$osd" ]; then
log "warning: $nvmepart is XFS but no OSD ID detected"
continue
fi
log "activating $nvmepart"
mkdir -p "/var/lib/ceph/osd/ceph-$osd"
chown ceph.ceph "/var/lib/ceph/osd/ceph-$osd"
mount $nvmepart "/var/lib/ceph/osd/ceph-$osd"
start ceph-osd id=$osd
done < <(ls /dev/nvme*n*p*)
# check if safe to remove
c=$(mount | grep /tmp/.$$.nvmeinit)
if [ -z "$c" ]; then
rm -rf /tmp/.$$.nvmeinit
fi
The above code is available here as well: https://gitlab.com/pawadski/blog/blob/master/ceph-creating-multiple-osds-on-nvme-devices-luminous/nvmeinit.kickstart