Karan Singh

Code Never Lies, Comments Sometime Do !!

Don't Underestimate the Power of Ceph Placement Groups

| Comments

Ceph PG

Today i would like to share findings from one of my curious testing, which came from this basic question.

How Placement Group count affects Ceph performance?

If you are reading this blog then i assume you know what Ceph is and how Ceph Placement Groups (PG) works.

It all started when me and my colleague Sir Kyle Bader were discussing around how to get more performance out of our Ceph cluster with the following environment details

  • 6 x OSD Nodes
    • 35 x HDD SATA 6TB
    • 2 x Intel P3700 PCIe
    • 40Gbe Cluster / Public network
    • Ceph Jewel (10.2.0)
    • 3x Replication
  • 16 x Client Nodes
    • 10Gbe
    • Application: CBT
    • Workload: Ceph Radosbench

We used CBT to triggers radosbench across 16 clients nodes that exercise Ceph cluster. The performance numbers we received from this test were not digestible. We were sure Ceph can deliver more, but need to find the reason why its not doing that. After some drama it came out to be CBT profile that creates 1 pool with 4096 PG’s. We bumped PG’s to a higher number and re-run tests and boom performance was as expected.

Out of curiosity, i thought it would be nice to have some data that shows Ceph performance at different Placement Group count. So i performed a few more CBT runs at different PG count and came out with this.

Ceph PG

In the above graph you can compare different PG count against bandwidth, at first glance you might not see affect of PG counts. To highlight that we normalized PG counts to single OSD, which makes situation a bit clear.

Ceph PG Ceph PG

Since we are running 3x Replication, the total PG count for a pool is also 3 times which gets stored across cluster. For instance 3x replicated pool with 4096 PG will actually have 4096x3=12288 PG distributed across cluster and if we normalize this number with total OSDs (210 in our case ) we will get ~58.5 PG per OSD ( we have another pools with some PG, which is why graph shows 61 instead of 58, but we can ignore this for now )

As you can see from the graphs, when you increase PG count to higher number, Performance of same Ceph cluster increases substantially. Reason being, you have more dense PG distribution and you are squeezing your media hard to get performance out of it.

Now gaining performance by increasing PG count is not always the best thing to do. It comes with several trade-offs

  • More PG’s leads to larger Cluster maps which can cause more load on MONs
  • Higher memory usage (needs to be tested)
  • Tough recovery scenarios (needs to be tested)
  • Ceph throws warning messages if PG per OSD count increases 200

With this blog i am not recommending you to bump up your production Ceph cluster’s PG’s to gain some performance. Instead it gives a hope that “yes performance can be improved using this way”.

Don’t forget to verify your cluster pools, they should have enough PG’s and PG Calc is your friend.