本篇是关于System Design - Tiny URL的设计思路和方法.

攻略模板

学习后总结

Questions 1: How many unique identifiers possible? Will you run out of unique URLs?

62^6 = 57 billion.

Questions 2:Should the identifier be increment or not? Which is easier to design? Pros and cons?

No. It is easier to design auto increment id. But if we need one more MySQL machine, it will be the issue.

Issues:
running out of cache
More and more write requests
More and more cache misses

SOLUTIONS:

Database Partitioning

1.Vertical sharding 2. Horizontal sharding.

The best way is horizontal sharding.

Currently table structure is (id, long_url). So, which column should be sharding key?

An easy way is id modulo sharding.

Here comes another question: How could multiple machines share a global auto_increment_id?

Two ways: 1. use one more machine to maintain the id. 2. use zookeeper. Both suck.

So, we do not use global auto_increment_id.

Solution1: put the sharding key as the first byte of the short_url.

Solution2: Use consistent hashing to break the cycle into 62 pieces. It doesn’t matter how many pieces because there probably would not be over 62 machines (it could be 360 or whatever). Each machine is responsible for the service in the part of the cycle.

Write: longURL -> hash(longURL)%62 -> put longURL to the corresponding machine according to hash value -> generate shortURL based on this value -> return shortURL

Read: shortURL -> get the sharding key(first byte of shortURL) -> search the corresponding machine -> return longURL.

Each time we add a new machine, put half of the range of the most used machine to the new machine.

Questions 3:Mapping an identifier to an URL and its reversal - Does this problem ring a bell to you?

Optimization:

  • Use geographical information as the sharding key. e.g. 0 for US, 1 for China. Put American DB is US, England DB in Europe.

  • Different locations use different web server and cache server. All the areas share one DB used to match the users to the closest web server(through DNS) when they have a miss on the cache.

Questions 4: How do you store the URLs? Does a simple flat file database work?

QPS is not high. We can use SQL.

Questions 5: What is the bottleneck of the system? Is it read-heavy or write-heavy?

Read heavy. We can use in-memory database as a cache layer to improve response speed.
参考Question2 and Question3.

Questions 6: Estimate the maximum number of URLs a single machine can store.

Let’s assume that a single machine has 100GB of storage.
Average size of longURL = 100B
Average size of shortURL identifier = 6B (6 letters(char))
an ID (int) = 4B
So each URL row = 110B, the number of URLs = 100 million.

Questions 7: Estimate the maximum number of queries per second (QPS) for decoding a shortened URL in a single machine.

Let’s assume that:

  • Daily User: 1,000,000
  • Daily usage per person: (Write)encode: 0.1, (Read)decode:1.
  • Daily request: Write 0.1M, Read 1M
  • QPS: Write 100,000 / 24*60*60s = 1.2, Read 1M / 24*60*60s = 12
  • Peak QPS: Write: 1.2 5 = 6, Read: 12 5 = 60.

参考Question5.

Questions 9: How could you handle redundancy? i,e, if a server is down, how could you ensure the service is still operational?

Questions 10: Keep URLs forever or prune, pros/cons? How we do pruning?

Daily new url: 0.1M. So it takes 100M/0.1M = 1000 days to fill up the storage. Obviously keeping URLs forever is preferable and storage should be cheap. If we run out of storage, we can delete the inactive URLs.

Solution1: Define URLs that last accessed timestamp > 300 days). Querying this in a database with 100 million rows should be quick.

Solution2: Using an in-memory database such as Redis or Memcached to store URLs. URLs will be pruned automatically using LRU Cache mechanism. But this is unrealistic in real world because memory is so much expensive than disk storage.

Questions 11: What API would you provide to a third-party developer?

API for URL Service, get URL status like total hits, referrals, last accessed time.

Questions 12: If you can enable caching, what would you cache and what’s the expiry time?

We can use in-memory database as a cache layer, such as Redis or Memcached. Using the 80/20 rule, roughly 80% of the total hits comes from 20% of the URLs. The expiry time could be short(10 min). It depends on QPS.

思考1: 什么时候使用No SQL 和 什么时候使用 SQL

  • 是否需要支持transactions,是的话使用SQL.
  • 是否需要复杂查询,需要的话使用SQL.
  • 是否需要使用AUTO_INCREMENT ID,是的话使用SQL. 因为NoSQL只能含有global unique Object id.
  • 是否对QPS有很高要求,是的话用NoSQL, Redis’ QPS could reach million level, MondoDB does 10K level, MySQL only supports K level.
  • 是否需要系统有高延展性,是的话NoSQL天生具有,而MySql需要写code to scale.

思考2:Redis or Memcached的流行和应用

Redis 和 Memcached 都是nosql数据库中的杰出代表,都属于内存型(in-memory database) 键值数据存储。之所以时下非常流行是因为MySql的开源属性导致大公司的去IOE化流行(节约成本),但是MySql对于负载量级几乎是QPS千级达到极限,在高并发场景下,突然的峰值等情况通过分布式缓存服务能够很好解决,单个实例下的Redis or Memcached都能达到QPS 10万级别及以上。

区别:Redis 是 加强版的Memcached, 如果结构简单,在小于1M的字符串存储场景下,Memcached效率会稍好一些。其余情况优先考虑Redis.

主要因为:

  1. Memcached的key限制在250B, value的限制为1MB. Redis 的key和value都支持512MB.
  2. Memcached只支持字符串存储,应用场景主要为读数据。 Redis支出字符串、哈希、列表、集合等各种类型和操作。
  3. Memcached为无数据持久型方案,只要重启,数据皆无。Redis能够提供主从复制功能replication,保证了就算某服务器崩溃也能不间断地提供服务。
  4. Memcached数据回收机制使用的是LRU算法,Redis采用数据回收机制更为复杂,允许6种不同的策略来细粒度的控制过期缓存。