Skip to content

Benchmarks

In this evaluation we've benchmarked Nakama on a number of workloads testing several of its core APIs you'd typically use to build a large scale multiplayer game.

The workloads have been run on a set of different hardware configurations to demonstrate the performance advantages that come from Nakama's modern hardware-friendly and highly-scalable architecture.

As the results demonstrate; Nakama's performance grows as the hardware size grows. Scaling both up and out offers many advantages: from simplified cluster management to access to generally better hardware and economies of scale.

Methodology

The benchmarks were performed using Tsung, a powerful, distributed load testing tool.

The Tsung workloads benchmark Nakama in single-node deployment (Nakama OSS) and clustered mode (Nakama Enterprise) in a few different configurations, using a single database instance.

The database instance hardware was kept constant throught all configurations and workloads to ensure there were no bottlenecks on I/O. Although we've also tested some database-bound APIs these benchmarks will focus on the capabilities of Nakama.

The Tsung servers are run on Google Compute Engine (GCE). Both Nakama OSS and Enterprise have been run on our Heroic Cloud infrastructure.

No warmup runs were executed before the actual workloads.

Setup

Tsung / Database

The Tsung topology consists of one master and twenty slave nodes. This setup was unchanged across all the benchmark runs and the hardware specification was:

Tsung Master Tsung Slaves Database
Instance Type n1-standard-32 n1-standard-32 dedicated-core vCPU
vCPU / Mem 6 / 8GB 3 / 2GB 8 / 30GB
IOPS (read/write) --- --- 3000

The database was set up on Google CloudSQL.

Nakama

We've run the benchmark workloads against three configurations:

Nakama OSS

  • 1 Node - 1 CPU / 3GB RAM

Nakama Enterprise

  • 2 Nodes - 1 CPU / 3GB RAM (per node)
  • 2 Nodes - 2 CPU / 6GB RAM (per node)

All the containers were running on the GCP instance type: "n1-standard-32" and were created on Heroic Cloud platform. The Nakama nodes are behind a GCP L7 load balancer.

Workloads

The proposed workloads are meant to display Nakama's throughput and capacity for effortless production-ready scale.

We'll present the benchmarking results for the following workloads:

  1. Number of concurrent socket connections (CCU count).
  2. Throughput of new user registration.
  3. Throughput of user authentication.
  4. Throughput of custom RPC call in the Lua runtime.
  5. Throughput of custom RPC call in the Go runtime.
  6. Number of authoritative realtime matches using custom match handlers.

The following subsections are respectively dedicated to each of the aformentioned workloads, where each one of them will be described in more detail; followed by the benchmark results gathered by Tsung for each of the considered hardware and topology configurations.

Results

Workload 1 - Number of concurrent socket connections (CCU count)

This workload consists of authenticating a user, opening a socket connection to Nakama, and keeping it open for around 200 seconds.

1 Node - 1 CPU / 3GB RAM
Number of connected users
Benchmark Register
2 Nodes - 1 CPU / 3GB RAM (per node)
Number of connected users
Benchmark Register
2 Nodes - 2 CPU / 6GB RAM (per node)
Number of connected users
Benchmark Register
Time to connect
Hardware Max Connected highest 10sec mean lowest 10sec mean Highest Rate Mean Rate Mean
1 Node - 1 CPU / 3GB RAM 19542 42.40 msec 26.54 msec 1340 / sec 196.12 / sec 34.21 msec
2 Nodes - 1 CPU / 3GB RAM (each) 27558 16.93 msec 15.92 msec 930 / sec 161.52 / sec 16.60 msec
2 Nodes - 2 CPU / 6GB RAM (each) 32092 20.17 msec 18.18 msec 1097.7 / sec 187.82 / sec 19.15 msec

Summary

A single Nakama instance with a single CPU core can have up to ~19,500 connected users. Scaling up to 2 nodes with 2 CPU cores each this values goes up to ~32,000 CCU.

Workload 2 - Register a new user

This workload emulates the registration of new users through the game server's device authentication API which stores the new accounts to the database.

1 Node - 1 CPU / 3GB RAM
Throughput (req/s)
Benchmark Register
2 Node - 1 CPU / 3GB RAM (per node)
Throughput (req/s)
Benchmark Register
2 Node - 2 CPU / 6GB RAM (per node)
Throughput (req/s)
Benchmark Register
Request statistics
Hardware highest 10sec mean lowest 10sec mean Highest Rate Mean Rate Mean
1 Node - 1 CPU / 3GB RAM 29.07 msec 20.10 msec 849.6 / sec 519.46 / sec 24.60 msec
2 Nodes - 1 CPU / 3GB RAM (each) 31.65 msec 20.01 msec 1014.3 / sec 672.18 / sec 25.95 msec
2 Nodes - 2 CPU / 6GB RAM (each) 0.14 sec 20.01 msec 1160.8 / sec 750.76 / sec 28.46 msec

Summary

A single Nakama server can handle average loads of ~500 requests/sec with requests served in 24.60 ms (mean) with a database write operation for a new user. At this rate a game can create 1.86 million new players every hour. This value goes up to 2.7 million player accounts per hour when scaled to 2 nodes.

Workload 3 - Authenticate a user

This workload consists of authenticating an existing user using the game server's device authentication API.

1 Node - 1 CPU / 3GB RAM
Throughput (req/s)
Benchmark Register
2 Node - 1 CPU / 3GB RAM (per node)
Throughput (req/s)
Benchmark Register
2 Node - 2 CPU / 6GB RAM (per node)
Throughput (req/s)
Benchmark Register
Request statistics
Hardware highest 10sec mean lowest 10sec mean Highest Rate Mean Rate Mean
1 Node - 1 CPU / 3GB RAM 33.40 msec 17.27 msec 802.2 / sec 499.63 / sec 24.61 msec
2 Nodes - 1 CPU / 3GB RAM (each) 27.87 msec 16.81 msec 1035.5 / sec 673.42 / sec 22.19 msec
2 Nodes - 2 CPU / 6GB RAM (each) 76.85 msec 16.95 msec 1162 / sec 776.77 / sec 25.03 msec

Workload 4 - Custom Lua RPC call

This workload executes a simple RPC function exposed through the Lua runtime. The function receives a payload as a JSON string, decodes it, and echoes it back to the sender.

1 Node - 1 CPU / 3GB RAM
Throughput (req/s)
Benchmark Register
2 Node - 1 CPU / 3GB RAM (per node)
Throughput (req/s)
Benchmark Register
2 Node - 2 CPU / 6GB RAM (per node)
Throughput (req/s)
Benchmark Register
Request statistics
Hardware highest 10sec mean lowest 10sec mean Highest Rate Mean Rate Mean
1 Node - 1 CPU / 3GB RAM 26.18 msec 15.01 msec 976.5 / sec 633.42 / sec 20.22 msec
2 Nodes - 1 CPU / 3GB RAM (each) 19.25 msec 15.68 msec 1192 / sec 706.71 / sec 17.48 msec
2 Nodes - 2 CPU / 6GB RAM (each) 20.27 msec 16.11 msec 1383.4 / sec 823.55 / sec 18.16 msec

Workload 5 - Custom Go RPC call

This workload executes a simple RPC function exposed through the Go runtime. The function receives a payload as a JSON string, decodes it, and echoes it back to the sender.

1 Node - 1 CPU / 3GB RAM
Throughput (req/s)
Benchmark Register
2 Node - 1 CPU / 3GB RAM (per node)
Throughput (req/s)
Benchmark Register
2 Node - 2 CPU / 6GB RAM (per node)
Throughput (req/s)
Benchmark Register
Request statistics
Hardware highest 10sec mean lowest 10sec mean Highest Rate Mean Rate Mean
1 Node - 1 CPU / 3GB RAM 26.12 msec 14.42 msec 975.9 / sec 635.36 / sec 19.97 msec
2 Nodes - 1 CPU / 3GB RAM (each) 20.87 msec 14.91 msec 1205.7 / sec 707.12 / sec 17.29 msec
2 Nodes - 2 CPU / 6GB RAM (each) 20.19 msec 15.59 msec 1386.4 / sec 820.41 / sec 18.00 msec

Summary

A single Nakama server can handle an average of ~600 requests/sec served in 19.97 msec (mean). When compared with the results with Workload 5, we see that the results between the Lua and Go runtime are very similar. This is because the benchmarked workload does not incur significant CPU computations; causing the results to be similar despite the differences of the Lua virtual machine. With CPU intensive code the performance results would start to differ as would RAM usage by the Lua runtime.

Workload 6 - Custom authoritative match Logic

This workload emulates a realtime multiplayer game running on Nakama's server-authoritative multiplayer engine. Although the client and custom logic are not an actual multiplayer game; the code creates an approximation of a real use-case scenario in terms of messages being exchanged between the server and the connected game clients. We'll briefly explain the server and client logic in this workload.

Server side logic

The server runs multiplayer matches with a tick rate of 10 ticks per second. Each match can have a maximum of 10 players.

The server implements an RPC call that the client can query to get the ID of an ongoing match (with less than 10 players). When this API is invoked, the server will use the Match Listing feature to look for matches that are not full and return the first result. If no matches were found; a new one is initiated.

The match loop logic is simple; the server expects to receive one of two opcodes from the client and performs either of the following actions:

  1. Echo back the received message to the client.
  2. Broadcast the message to all of the match participants.

Client side logic

The client logic is also simple; each game client performs the following steps in-order:

  1. Authenticates an existing user with Nakama to receive a token.
  2. Execute the server RPC function to receive an ID of an ongoing match (which is not full).
  3. Establishes a websocket connection with the realtime API.
  4. Join the match with the ID received in step 2.
  5. For 180 seconds the client will loop and each half second will alternate between sending a message with opcode 1 or 2.

The messages sent by the client contain a payload of fixed size with a string of 44 and 35 characters for opcode 1 and 2 respectively.

1 Node - 1 CPU / 3GB RAM
Number of connected users
Benchmark Register
2 Node - 1 CPU / 3GB RAM (per node)
Number of connected users
Benchmark Register
2 Node - 2 CPU / 6GB RAM (per node)
Number of connected users
Benchmark Register

These results are the averages for each request made by the client because this workload involved:

  1. Authentication
  2. RPC Call
  3. Connect to websocket and
  4. Send messages through the websocket connection;

the results take into account the entire set of request logic performed within each of the client sessions.

Request statistics
Hardware highest 10sec mean lowest 10sec mean Highest Rate Mean Rate Mean
1 Node - 1 CPU / 3GB RAM 42.21 msec 1.07 msec 126.5 / sec 36.72 / sec 15.06 msec
2 Nodes - 1 CPU / 3GB RAM (each) 0.10 sec 1.14 msec 213.8 / sec 54.68 / sec 28.21 msec
2 Nodes - 2 CPU / 6GB RAM (each) 41.82 msec 1.07 msec 350 / sec 85.82 / sec 15.93 msec

The table below includes the amount of network throughput handled by the game server with the data messages exchanged within the matches. We can see that the number of bytes received by the clients is much higher than the number of bytes sent; 50% of messages sent by clients introduce a broadcast to all match participants by the server as noted above.

Network Throughput
Hardware Sent/Received Highest Rate Total
1 Node - 1 CPU / 3GB RAM Sent 4.65 Mbits/sec 157.92 MB
Received 24.88 Mbits/sec 809.65 MB
2 Node - 1 CPU / 3GB RAM (each) Sent 5.90 Mbits/sec 201.35 MB
Received 31.96 Mbits/sec 1020.68 MB
2 Node - 2 CPU / 6GB RAM (each) Sent 7.64 Mbits/sec 261.61 MB
Received 40.54 Mbits/sec 1.30 GB