CephFS snapshot, restore and cloning with Ceph CSI driver

In our previous article we discussed about the terminology side of Kubernetes volume snapshots and concepts around it. If you haven’t read through it, I would recommend to have a check on this article, may be before getting into this. But, I will leave that upto you.

This article describe about how CephFS snapshots can be created on CephFS volumes provisioned by the Ceph CSI driver. Once you have snapshots available you can also restore these snapshots to “NEW” Persistent Volume Claims!. We also have the ability to CLONE an existing PVC to a new Volumes! This covers a good amount of use cases related to Backup and Restore in many scenarios. The only requirement here is to have Ceph CSI version 3.1.0 running in your setup either directly by deploying Ceph CSI via templates from its project repository or by installing Rook version >= 1.4.1.

I have also captured a demo of this process in below screen cast.

Considering I have documented the process in the screen cast, its better to not duplicate it here. However if you have any queries, suggestions or comments on this, please place a comment here or reach out!

Kubernetes Volume Snapshot , Restore and PVC cloning introduction

As you know Snapshot is Point In time copy of a volume. In Kubernetes, a VolumeSnapshot represents a snapshot of a volume on a storage system.
Similar to how API resources like PersistentVolume and PersistentVolumeClaim are used to provision volumes for users and administrators in an kubernetes/OCP cluster, VolumeSnapshot and VolumeSnapshotContent are API resources that are provided to create volume snapshots.

For supporting or working with snapshot, we have mainly 3 API objects or Custom resources and to name those:
We have volumeSnapshot, VolumeSnapshotContent and VolumeSnapshot Class. These objects or workflow is analogues to Persistent Volume Claim (PVC) workflow if you are aware of.

A VolumeSnapshot is a request for snapshot of a volume by a user. This is a user property. It is similar to a PersistentVolumeClaim where user request a storage.

VolumeSnapshotClass is similar to storage class but for Snapshot requests. Its an admin property in an ocp cluster. It allows you to specify different attributes belonging to a VolumeSnapshot. These attributes may differ among snapshots taken from the same volume on the storage system and therefore cannot be expressed by using the same StorageClass of a PersistentVolumeClaim.

There is also volumesnapshotcontent object, A VolumeSnapshotContent is a snapshot taken from a volume in the cluster. It is a resource in the cluster just like a PersistentVolume is a cluster resource. IN dynamic provisioning world an admin/user dont have to worry about this object. That said, its getting created and managed by internal controllers.

In Slide 1, you could map volumesnapshot workflow and objects with persistent volume claim workflow or object.

I think its prety clear and we can see that its analogous to the persistent volume claim workflow. I have listed the yaml representation of volume Snapshot class, volume snapshot and volumesnapshotcontent in the slides so that you get some idea on how it looks like. Just to touch upon, a volumesnapshotClass object representation looks like this. Things to note here is the highlighted part where admin has to define the driver , deletion policy ..etc.
We just went through the volumesnapshotclass, so next is volumesnapshot and its the user input and how a user can request a snapshot from the cluster. If you look at the highlighted part, this is where user request saying, I want to create a snapshot from a PVC or volume called “pvc-test”. The `spec.Source` has been filled with the parent or source PVC. On right side you can also see volumesnapshotclass object. As mentioned earlier in dynamic provisioning case, an admin or user dont have to create this object rather its the responsibility of the controller. On a bound object you can see it as volumesnapshotclass got reference to the user request via VolumeSnapshotRefernce and also to the parent volume with the volume handle.

Thats mainly on the yamls. While working with snapshots, we should be knowing few things or some rules:

The parent/source PVC should be at bound state.
PVC should not be in use while snapshot is in progress.

There is also PVC protection available to ensure that in-use PersistentVolumeClaim API objects are not removed from the system while a snapshot is being taken from it (as this may result in data loss). Controller add a finalizer to make sure pvc is not getting removed.
Thats about snapshot , now lets think about Restore:


Restore:

While talking about restore, we are talking about restoring the snapshot to a completely NEW volume and we are not talking about In place restore or Roll Back or revert here. The lack of support for inplace restore or rollback has listed as a limitation in upstream kubernetes atm. So, We could expect this support in future Kubernetes releases.

if you look at the Yaml of a request to restore, you could see that, the highlighted part is where you mention you want to restore to a new volume from the volumesnapshot object named ‘new-snapshot-test”. Thats the only difference compared to the general persistent volume claim request.

In short restore input looks like a PVC request or IOW, its a PVC request with Data Source as ‘volumesnapshot’.

Clone:

Clone is about creating a new volume from an existing volume. Clone is defined as a duplicate of an existing Persistent Volume Claim/PVC. So, instead of creating a new Empty volume we are requesting that, we need a volume with prepopulated content from the source PVC. Its a simple interface.
If you look at the yaml representation its looks like a PVC request with a small difference that, we have Datasource field filled with an existing PVC name.

If you go back and compare restore and clone yamls, the only difference is that, the data source is “Snapshot” in case of “restore” and data source is “PVC/peristentvolumeclaim” in clone .
As we found in snapshot case, While working with Clone, we should be knowing that:

The source PVC must be BOUND and not in use
Only same namespace cloning is allowed , that means the source and destination PVC should be same namespace.
only within same storageclass cloning is allowed.

Cloning can only be performed between two volumes that use the same VolumeMode setting. When I say volumemode its nothing but kubernetes allow you to specify the PVC with either Filesystem or Block mode. So cloning is only allowed with same volumemode setting.
Thats about cloning…

So we went through an overview of snapshot, restore and clone features.

Things to Remember when working with these features:

Here i would like seek your attention mainly on below items/bullets.

When it comes to Data Consistency:

No snapshot consistency guarantees beyond any guarantees provided by the storage system (e.g. crash consistency). So its upto the storage vendor atm to provide the guarantee on the data consistency. You could provide more consistency but its upto the responsibility of higher-level APIs/controllers.

Inflight I/Os:
The API server does not enforce the source volume use ( in use or not ) while a user requests a snapshot/clone. Eventhough its mentioned as prerequisite or recommendation its just on the books/docs, no code in controller stop or prevent the request if its in use.

No inplace-restore:

As discussed some time back, at present It does not support reverting an existing volume to an earlier state represented by a snapshot (beta only supports provisioning a new volume from a snapshot).

User is free to delete the parent/source Volume:

Once the snapshot/clone is created, user is free to delete the source PVC/volume. This gives some challenges while we have Copy On Write based snapshots which is the case with many storage backends. Because the COW snapshots got a reference linking to parent volume and to enable the parent volume deletion we have to untangle the snap from the parent volume. Its bit of a work which CSI driver or other components depends on the storage system.

E2E WorkFlow

With that, let me move to the next slide which give some idea about how the snapshot request e2e works and the components involved in the path.

AS you can see in the slides there are 2 additional components which we havent discussed yet. One is the snapshot controller which is deployed with Kube/OCP platform and its main job is to watch for volumeSnapshot object, we also have a sidecar deployed with CSI bundle called csi snapshotter who watch for volumeSnapshotContent object.
The other primitives like volumesnapshotclass , volumesnapshot, and volumesnapshotcontent we already discussed in previous slides.
The volumesnapshotClass is available in the cluster and user can make a request for snapshot via volumesnapshot object, which is getting monitored by the snapshot controller and it create or populate the volumesnapshotcontent object which in turn monitored by the the csi-snapshotter sidecar. Once the CSI snapshotter sidecar container see there is a volumesnapshotContent object it talks to the CSI driver ( Example Ceph CSI driver) through the CSI endpoint and now its the the turn of CSI driver to talk to the backend or storage cluster and create a snapshot.

Just to repeat, we have 2 additional pods with controllers called

snapshotcontroller and csi snapshotter.

I hope this workflow is clear and helps to understand how the request originated from the user and land with snapshot creation in the backend.

Please let me know if you got any questions.

References:

Volume Snapshots:
https://kubernetes.io/docs/concepts/storage/volume-snapshots/

Restore from snapshots:
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-snapshot-and-restore-volume-from-snapshot-support

CSI Volume Cloning
https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource/

Persistent Volume and Claim ( PV and PVC) status in kubernetes

Most of the time, a user or an admin in a Kubernetes or Openshift cluster is confused about what is meant by the persistent volume and persistent volume claim status field.

Just to make sure we are on same page, this is about the “STATUS” field in the output captured below:

[terminal]
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
datastore-demo-sts-0 Bound pvc-5a541780-dda4-4462-9dde-c143ece76341 1Gi RWO gp2 33d
datastore-demo-sts-1 Bound pvc-686efec4-fb35-4afe-a175-2f2e064b74fb 1Gi RWO gp2 33d
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
postgres-pv-volume 5Gi RWX Retain Bound bsk-n1-dev3/postgres-pv-claim manual 7d22h
pvc-5a541780-dda4-4462-9dde-c143ece76341 1Gi RWO Delete Bound default/datastore-demo-sts-0 gp2 33d
pvc-686efec4-fb35-4afe-a175-2f2e064b74fb 1Gi RWO Delete Bound default/datastore-demo-sts-1 gp2 33d
[/terminal]

Persistent Volume States: PV states:

It can be any of:

“Pending”, “Available”, “Bound”, “Released” or “Failed”

[terminal]
Pending : Used for PersistentVolumes that are not available.

Available: Used for PersistentVolumes that are not yet bound to a PVC.

Bound : Used for PersistentVolumes that are bound with a PVC.

Released : Used for PersistentVolumes where the bound PersistentVolumeClaim was deleted. Released volumes must be recycled before becoming available again.This phase is used by the persistent volume claim binder to signal to another process to reclaim the resource

Failed : Used for PersistentVolumes that failed to be correctly recycled or deleted after being released from a claim
[/terminal]

Persistent Volume Claim States: PVC states:

It can be any of:

“Pending”, “Bound”, or “Lost”

[terminal]
Pending: Used for PersistentVolumeClaims that are not yet bound

Bound: Used for PersistentVolumeClaims that are bound

Lost: Used for PersistentVolumeClaims that lost their underlying PersistentVolume. The claim was bound to a PersistentVolume and this volume does not exist any longer and all data on it was lost.
[/terminal]

Ceph CSI: “XFS Superblock has unknown read-only..” Or ” wrong fs type, bad …. on /dev/rbd., missing codepage or ..”

We announced ceph csi v2.1.0 release around 3 weeks back with many bug fixes and features. More details about this release can be found here https://github.com/ceph/ceph-csi/releases/tag/v2.1.0 and in this blog https://www.humblec.com/one-more-ceph-csi-release-yeah-v2-1-0-is-here/. From the CSI/Rook community interactions we know many folks have updated to this version in the last couple of weeks.

However, unfortunately, the community found a couple of issues with their app pod PVC mounting (XFS based) with this CSI version in their Kubernetes/openshift cluster.

There are mainly 2 errors/issues encountered in setups:

1) XFS: wrong fs type, bad option, bad superblock on /dev/rbd4, missing codepage or helper program, or other error
2) XFS: Superblock has unknown read-only compatible features (0x4) enabled

As you can see above, both of these issues are on XFS mounting. Ceph CSI plugin make use of `mount.xfs` binary at time of volume mounting and its part of “xfsprogs” package shipped by the CSI container. To understand this issue better, lets first see the change:

[terminal]
ceph v2.1.0 release: mkfs.xfs version 4.5.0

ceph v2.1.0 release: mkfs.xfs version 5.0.0
[/terminal]

This change happened when we updated the Ceph base image in our containers from v14.2 to v15 in ceph CSI version 2.1.0 compared to 2.0.0

[terminal]
ceph v2.1.0
BASE_IMAGE=ceph/ceph:v14.2

ceph v2.1.0
BASE_IMAGE=ceph/ceph:v15
[/terminal]

To give some more context about these issues:

The first one is more generic:

XFS: wrong fs type, bad option, bad superblock on /dev/rbd4, missing codepage or helper program, or other error

That said, this is seen when you have multiple PVCs with the same XFS UUID ( filesystem UUID) and the PVC mount fails when its attached to a pod.

The reason to have multiple PVCs with the same UUID arises when you have a snapshotted/cloned volume. Even though ceph CSI snapshot/clone support is still in `Alpha` state we have proactively fixed this issue [1].

The second issue:

XFS: Superblock has unknown read-only compatible features (0x4) enabled

pop up when the cluster nodes are running RHEL 7 based kernels ( for ex: redhat kernel 3.10.0-957 el7). The `mkfs.xfs` binary in the new CSI container image is not really compatible with this kernel version and you will encounter ‘unknown read-only compatible features (0x4) read-only error at the time of mounting’. To avoid this error the `mkfs.xfs` call would need to include `-m reflink=0` which disable the incompatible copy-on-write reflink support.

More details about these issues can be found at https://github.com/ceph/ceph-csi/issues/966

[1] Solution:

Both of these issues have been addressed with ceph CSI v2.1.1 which got released today https://github.com/ceph/ceph-csi/releases/tag/v2.1.1. The immediate fix what we made with this new release is, reverting to older `mkfs.xfs` version. The real fix is known and we will make it soon. But, till then we don’t want to leave our awesome community with broken setup.

We advise you to update to v2.1.1 if you are using `xfs` as your PVC `fsformat` in your kubernetes/openshift cluster.

Special thanks to below community users for reporting this issue and helping us with the various stages of this debugging process!

https://github.com/chandr20
https://github.com/iExalt
https://github.com/fiveclubs
https://github.com/volvicoasis

Happy hacking and talk to us via slack ( http://cephcsi.slack.com) or github https://github.com/ceph/ceph-csi/.

Ceph CSI v1.1.0 Released!!

Ceph CSI team is excited that it has reached a huge milestone with the release of v1.1.0!

https://github.com/ceph/ceph-csi/releases/tag/v1.1.0

Kudos to the Ceph CSI community for all the hard work to reach this critical milestone. This is our first official release ( tracked @ https://github.com/ceph/ceph-csi/issues/353 ) and it is out on 12-Jul-2019. This is a huge release with many improvements in CSI based volume provisioning by making use of the latest Ceph release ( Nautilus ) for its use in production with Kubernetes clusters. One of the main highlights of this release is ceph subvolume based volume provisioning and deletion.

Highlights of this release:

*) CephFS subvolume/manager based volume provisioning and deletion.
*) E2E test support for PVC creation, App pod mounting.etc.
*) CSI spec v1.1 support
*) Added support for kube v1.15.
*) Configuration store change from configmap to rados omap
*) Mount options support for CephFS and RBD volumes
*) Move locks to more granular locking than CPU count based
*) Rbd support for ReadWriteMany PVCs for block mode
*) Unary plugin code for ‘CephFS and RBD’ drivers.
*) Driver name updated to CSI spec standard.
*) helm chart updates.
*) sidecar updates to the latest available.
*) RBAC corrections and aggregated role addition.
*) Lock protection for create,delete volume ..etc operations
*) Added support for external snapshottter.
*) Added support for CSIDriver CRD.
*) Support matrix table availability.
*) Many linter fixes and error code fixes.
*) Removal of dead code paths.
*) StripSecretInArgs in pkg/util.
*) Migration to klog from glog
……….

Many other bug fixes, code improvements, README updates are also part of this release. The container image is tagged with “v1.1.0” and its downloadable by docker pull quay.io/cephcsi/cephcsi:v1.1.0

We have also updated the support matrix for better notification of available CSI features and its status in upstream.

https://github.com/ceph/ceph-csi#Support-Matrix

We would like to thank the Rook team for unblocking the CSI project at various stages!

We are not stopping here, but moves forward at a good pace to catch up with our next ‘feature’ rich release tracked at https://github.com/ceph/ceph-csi/issues/393. If you would like to see some features or get some bug fixes done in the next release, please help us by mentioning it in the same release tracker.

We are also kickstarting upstream bug triage call from next week, so please be part of it. More details about this call is available @ https://github.com/ceph/ceph-csi/issues/463

Happy Hacking!

PS/NOTE: This release needs the latest Ceph Nautilus cluster to support cephfs subvolume provisioning and this version of the cluster is made available if you deploy CSI with Rook Master.

Ceph CSI driver deployment in a kubernetes cluster

I have recently published a blog on how to deploy Ceph Cluster in a kube setup. If you don’t have this cluster up and running please refer this article. For this attempt we need below components/software deployed successfully in a setup. Kubernetes Ceph cluster Ceph CSI driver The first two deployments ( Kubernetes cluster and …

Read more

Deploy a ceph cluster using Rook (rook.io) in kubernetes

[Updated on 20-Jun-2020: Many changes in Rook Ceph in previous releases, so revisiting this blog article to accomodate the changes based on a ping in the slack 🙂 ] In this article we will talk about, how to deploy Ceph ( a software-defined storage) cluster using a Kubernetes operator called ‘rook’. Before we get into …

Read more

Gluster CSI driver 1.0.0 (pre) release is out!!

We are pleased to announce v1.0.0 (pre) release of GlusterFS CSI driver. The release source code can be downloaded from github.com/gluster/gluster-csi-driver/archive/1.0.0-pre.0.tar.gz. Compared to the previous beta version of the driver, this release makes Gluster CSI driver fully compatible with CSI spec v1.0.0 and Kubernetes release 1.13 ( kubernetes.io/blog/2018/12/03/kubernetes-1-13-release-announcement/ ) The CSI driver deployment can be …

Read more

How to reattach a PVC to an existing PV or migrate PVC from one namespace to another in Kubernetes/Openshift cluster

I have been advising many users on various channels ( mail, slack..etc) on how to accomplish PVC migration or reattaching an existing PV to a new PVC for various use cases in the past. That said, the use cases involve scenarios like if the user wants to attach a new PVC to an older/existing PV or it could be that someone wants to migrate a PVC from one namespace to another. But, this hack/workaround always remained out of support contract and helped folks who wanted to achieve the end result in some manner, so keep it in mind before you attempt this.

The PVC PV binding always appears that a 1:1 mapping and at times users want to attach an existing PV to a new PVC which could be in another namespace.

Lets start: As you know, a bound PVC has a `reclaimPolicy` which is default to “delete”. If the PV which you want to attach to a new PVC is of “delete” policy you need to edit the PV spec and mark reclaimPolicy as “Retain”.

[terminal]
“persistentVolumeReclaimPolicy”: “Retain”,
[/terminal]
Before you begin all of this process, lets backup an existing PVC yaml/json:

[terminal]
oc get pvc -o yaml > backup_pvc.yaml [/terminal]

As for any hacks on storage, I would recommend to backup the data on the volume which is mapped to PV. Data is critical always! so based on your criticality, back up it from the storage backend. The storage backend could be any and in my case it is GlusterFS.

Once the data is backed up, let’s delete the original PVC.

[terminal]
kubectl delete pvc pvcname
[/terminal]
When you delete the PVC, the PV state should move along. It should soon transitioned to `Released` State. Wait for the PVC status to reflect “Released” and once its on “Released” state, edit the PV and delete `claimRef` field from PV spec/definition, which refers to now-deleted PVC.

Extract or fetch and keep the PV name for the future. We need that for the new PVC. Once we have it, create a new PVC in the desired namespace that refers to the `volumeName` field to the old PV name.

For example:

[terminal]
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: newclaim
spec:
accessModes:
– ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: glusterfs
volumeMode: Filesystem
volumeName: pvc-3466cff4-g4gb-12e9-962b-54009bg11116
[/terminal]

The new PVC should bind to existing PV and volume should become available in new namespace.!!

CRDs , Operators/Controllers, Operator-sdk – Write your own controller for your kubernetes cluster – [Part 1]

You are here, so there are two possibilities, either you already know about below terms/strings or you want to know more about these strings. In any case, I have to touch upon these strings before we proceed further. Custom Resource Definitions ( CRD) Custom Resources ( CR) Operators/Controllers Operator SDK Custome Resource Definitions/CRDs: In the …

Read more