March 2023

The notes of the month.

Practical Optimizations by Jason Booth

In his video, as an experienced game developer, he talks about data-oriented degign and use cases in some of the projects he took part it. However, not everything was applicaple, as i see Unity packages are unmaintaned for .NET. Some of the things I noted was

  1. Keep data together. Meaning instead of having thousands of objects, spreaded to everywhere, keep them in manager class. In practice, it looks like this.

// Instead of this
public class FootballPlayer 
{
	public decimal Speed { get; set; }
	public decimal Hardworking { get; set; }
	public byte Number { get; init; }
	// And other properties
	
	public void TrainSpeed() 
	{
		Speed += Hardworking * 0.02;	
	}
}

// You can use this
public class FootballPlayerManager 
{
	public decimal[] Speeds { get; set; }
	public decimal[] Hardworkings { get; set; }
	public byte[] Numbers { get; init; }
	
	public void TrainSpeed()
	{
		for(var i = 0; i<Speeds.Count; i++) 
		{
			Speeds[i] += Hardworkings[i] * 0.02;
		}
	}
}

In this way, we can update the data much faster. We can even use Simd if we ever need. Or just like Parquet, we can have a zone map.

  1. List of structs over list of classes. As the data stays side by side, especially looping on it will be much faster, you’ll be able to utilize caches much better.

  2. Avoid virtual calls in loops. Same for polymorphism.

How Discord Stores Trillions of Messages

About Discord’s recent blog.

  1. Keep the related data together in the database.

It seems like Discord was keeping data together based on channel and bucket. Bucket is about some static time window. If you’re reading from a public channel, you’ll most likely to read last messages, and messages that were sent in close times. So keeping the related data together, especially based on time in this case, would reduce the number of I/Os a lot. How? The blog is about Cassandra but same ideas are applicaple for relational databases which saves the data in pages in the disk. In the pages, there are tuples(rows) side by side. When you’re reading data it, you take the page and put it to buffer pool(in the memory). Let’s visualize that.[1]

Image

So when you read 10 unrelated data from the database, you may copy 10 different pages to the buffer pool. But if they were kept together, you may need only a single page read. We have options such as, UUID, clustered index and partitioning table. You also need to consider if a partition has very few data, but others have a lot.

  1. LSM based databases have faster writes but slower reads. They’re just writing everything, read, update, insert, delete. Later while compacting they remove unnecessary ones. According to blog, the problems arised due to hotspots because the database doesn’t a cache like buffer pool, constantly going to the disk. So they solved their problems by implementing a data service(cache) between services and the database.

Faster than Rust and C++: the PERFECT hash table by strager

As the title suggest, in the video, he incrementally optimizes hash table. There are a lot of tips, but the most important lesson was the knowing your data.

Sources

[1]: 03 - Database Storage 1 (CMU Intro to Database Systems / Fall 2022)