It is not possible to take advantage of NVMe SSD bandwidth with single OSD. 4 is the optimum number of partitions per SSD drive that gives best possible performance.

Ceph has a very interesting article regarding SSD/NVMe storage. It does not seem to mention however how these partitions were created in the first place so here's what I've come up with.


Say you have a 2TB NVMe device you'd like to split into two OSDs.

# define a disk variable 
# zap the disk 
ceph-disk zap /dev/$disk 
# create partitions for OSD metadata and storage 
parted /dev/$disk -a optimal mkpart primary 1049kB 1GB 
parted /dev/$disk -a optimal mkpart primary 1GB 1000GB 
parted /dev/$disk -a optimal mkpart primary 1000GB 1001GB 
parted /dev/$disk -a optimal mkpart primary 1001GB 2000GB

In the above example, I am creating ~1GB partitions for OSD metadata and ~1000GB partitions for OSD storage. I realize I could've dedicated less space to the metadata - I'm just trying to keep it simple.

Creating the OSDs

Now that we've created the partitions, let's build basic structure required for the OSDs. The process is as follows.

1. Get UUIDs of the block partitions to use for OSD storage

# ls -l /dev/disk/by-partuuid/ | grep ${disk}p 
lrwxrwxrwx 1 root root 15 Mar 6 16:46 1b92e750-6fbc-48bf-92ff-224c9cfaf6ed -> ../../nvme2n1p1 
lrwxrwxrwx 1 root root 15 Mar 6 16:46 be1d844a-5b21-4fe1-90e1-e04baaf7105e -> ../../nvme2n1p2 
lrwxrwxrwx 1 root root 15 Mar 6 16:46 0e620255-ac4d-4b24-b0bb-3e07ac5c9d5d -> ../../nvme2n1p3 
lrwxrwxrwx 1 root root 15 Mar 6 16:46 357bd8be-673c-4b12-8ea0-fbe1e0152fa5 -> ../../nvme2n1p4

Save UUIDs of the second and fourth partition - we will need to explicitly tell Ceph to use these for OSD data storage.

2. Prepare the OSD metadata

ceph-disk prepare --bluestore /dev/${disk}p1 --block-uuid $uuid1 
ceph-disk prepare --bluestore /dev/${disk}p3 --block-uuid $uuid2

I'm assigning the UUIDs to a variable for convenience - they will be required later again.

Mount the OSD metadata into a temporary directory - ensure the symbolic link is present:

# mkdir -p temp 
# mount /dev/${disk}p1 temp/ 
# ls -l temp/ 
lrwxrwxrwx 1 ceph ceph 58 Jan 26 09:33 block -> /dev/disk/by-partuuid/be1d844a-5b21-4fe1-90e1-e04baaf7105e 
# umount temp

If it isn't, create it manually with ln

3. Set permissions on device and activate the device

chown ceph.ceph -R /dev/${disk}* 
ceph-disk activate /dev/${disk}p1 
ceph-disk activate /dev/${disk}p3

That should be it for creation of the OSD. You can now adjust the crush position, device class et cetera.

Ensuring the OSDs start on boot

In my experience Ceph does not yet start NVMe OSDs on boot. You could create an entry in /etc/rc.local or use this tiny script (put it somewhere and add it to /etc/rc.local) I've created for this exact purpose (if you use kickstart to manage services on your nodes):

# mounts and starts NVMe osds I guess
# kickstart only for now
# add execution of this script to /etc/rc.local

function log {
  echo "nvmeinit: $1"
  logger -t "nvmeinit" "$1"

# check if we have NVMe drives in to begin with
if [ -z "$(ls /dev/nvme*)" ]; then
	log "wasn't able to detect NVMe drives, aborting"


chown ceph.ceph /dev/nvme*

mkdir -p $workdir

while read nvmepart; do
  ismounted=$(mount | grep "^$nvmepart")

  if [ ! -z "$ismounted" ]; then
    log "$nvmepart already mounted, skipping"

  xfs=$(file -s $nvmepart -b | grep XFS)

  if [ -z "$xfs" ]; then

  mount $nvmepart $workdir

  osd=$(cat $workdir/whoami 2>/dev/null)

  umount $nvmepart

  if [ -z "$osd" ]; then
    log "warning: $nvmepart is XFS but no OSD ID detected"

  log "activating $nvmepart"

  mkdir -p "/var/lib/ceph/osd/ceph-$osd"
  chown ceph.ceph "/var/lib/ceph/osd/ceph-$osd"

  mount $nvmepart "/var/lib/ceph/osd/ceph-$osd"

  start ceph-osd id=$osd
done < <(ls /dev/nvme*n*p*)

# check if safe to remove
c=$(mount | grep /tmp/.$$.nvmeinit)

if [ -z "$c" ]; then
  rm -rf /tmp/.$$.nvmeinit

The above code is available here as well: