Data Center Networking

Currently, Data Center applications are more and more important to the company such as microsoft, google, amazon. In this senario, Should we still need the TCP/IP network stack? We know the TCP/IP is target for complexity environment, the routering, the failure handling. Those are unnecessary for the data center, maybe it is time for us to redesign the network stack in the data center.

Here are some of my initial ideas about this topic.

We partition the machines in data center into several groups. In each group, the machines are all connected. We don’t need maintain the connections, the resent machenis.

1.jpg

Comments (1)

Dynamo: Amazon’s Highly Available Key-Value Store

System Assumptions and Requirements

  • Query Model: simple read and write operations to a data item that is uniquely identified by a key.
  • Dynamo targets applications that operate with weaker consistency if this results in high availability. Dynamo does not provide any isolation guarantees and permits only signle key updates.
  • Efficiency: The system needs to function on a commodity hardware infrastructure. services have stringent latency reqquirments which are in genneral measured at the 99.9th precentile of the distribution. (it will provide a response within 300ms for 99.9% of its requests fro a peak client load of 500 requests per second.)

Design Considerations

  • Weak Consistence ( eventually consistent)
  • Application resolves conflicts (always writable)
  • Incremental scalability
  • Symmetry
  • Decentralization
  • Hetergeneity

System Interface:

  • get(key) : return a single object or a list of objects with conflicting versions along with a context
  • put(key, contect, object)

Experiences and lessons learned

  • The main advantage of Dynamo is that its client applications can tune the values of N, R and W to achive their desired levels of performances, availability and durability.
  • Using an object buffer in each node main memory. Each write operation is stored in the buffer and gets periodically written to storage by a writer thread.
  • 99.94% of requests saw exactly one verison;0.00057% of requests saw 2 versions; 0.00047% of requests saw 3 versions and 0.00009% of requests saw 4 versions (amazing)
  • client-driven coordination is better than server-driven coordination.
  • Balancing background vs. foreground tasks.

Leave a Comment

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

http://www.gotw.ca/publications/concurrency-ddj.htm

并行编程中必须考虑的两个问题是被处理数据和任务间通讯。经过用户的选择与市场的淘汰,现在的并行编程标准基本上趋向以下三种:

1、数据并行。特点,各任务处理的数据彼此分离,任务间通过消息传递进行通讯;数据分离和消息传递工作由编译器完成。

HPF(High Performance Fortran,高性能Fortran)是典型的数据并行编程语言。因为目前的编译器技术对实际应用中各种不规则问题的解决方案仍不够理想,加上专注于数据并行,因此HPF未获广泛应用。

2、消息传递。特点,各任务处理的数据彼此分离,任务间通过消息传递进行通讯;数据分离和消息传递工作由程序员和用户完成,因此对程序员要求很高。这种模式非常适用于消息传递的体系结构(如机群系统),用户和程序员主要需考虑的是通讯同步和通讯性能问题。

并行虚拟机(PVM,Parallel Virtual Machine)和消息传递接口(MPI,Message Passing Interface)是两种广泛使用的消息传递并行编程标准。其中PVM侧重异构环境下的可移植性和互操作性;MPI更强调性能,但在异构环境下有不同的实现。几乎所有的高性能计算系统都支持PVM和MPI。

3、共享内存。特点,各任务处理的数据实现内存共享,任务间也通过共享数据实现通讯;数据共享可由程序员或编译器完成。共享内存并行编程主要应用在对称多处理器(SMP ,Symmetric Multi Processors)系统上。

OpenMP(Open MultiProcessing由X3H5发展而来)和PThread(POSIX Thread)都是共享内存并行编程的实现。

OpenMP由1993年建立的X3H5标准发展而来,目前已成共享内存并行编程的实际工业标准,得到DEC、Intel、IBM和Sun等厂商广泛支持。它在Forthan、C/C++得到了实现,主要支持隐式并行编程,即编译器实现并行。

PThread主要在Unix系统上使用。Unix的实现系统很多,比如Linux、FreeBSD、Solaris、Mac OS X等。要在众多“类UNIX”上开发跨平台的多线程应用,绝非易事,因此制定了POSIX Thread标准。David R. Butenhof(Boost库发起者之一,ISO C++标准委员会成员)的《Programming with POSIX Threads》这本书,可以说是Unix上编写多线程应用的必备参考书。对其他平台并行程序开发也有很高参考价值。

总的来说,共享内存并行编程与目前大多数的多线程程序员思维习惯最为接近,是程序员从单核转向多核系统需付代价最小的方案。但专家仍有不同意见,比如Herb Sutter就不看好OpenMP,因为共享内存并行编程本质上并没有太多改进,仍然依赖数据资源的锁定,这会带来性能问题。消息传递并行有性能优势,但对程序员的要求又太高了。所有这些难题,还需要研究并行和各种标准、库的专家继续努力解决。

Comments (1)

AltaVista Index Talk (Mike Burrows)

http://www.researchchannel.org/prog/displayevent.aspx?rID=2123

some notes:

  1. continuation design is good, avoid a lot of unused branchs.
  2. For some critical code piece, Choose instructions to dual-issue well, Fixed word structure allows prefetch, Avoid branch mispredictions.
  3. branch mispredications will suffer a lot of performance, but the question is that in x86 how to reduce the branch mispredication, I am very familiar with how to optimize the application in RISC, but not x86 micro-architecuture. Now I use the Vtune tools to get the performance results but don’t know how to reduce the branch mispredications. If you have some ideas, please tell me. thanks a lot.
  4. The interface for index stream readers(ISRs):  loc(),  next(), seek(X)
  5. constraint solve processing is very critical to index serve.
  6. Queries take about 100 cycles/query/MByte(AltaVista), 1.5G index size
  7. 30% inner loop, 15% constraint solver, 15% higher level seek code, 7% ranking code, 0.2% merging results, Miss ratios: 2% I-cache, 8% D-cache, 8% level-2 cache, 40% level-3 cache. It seems that the current ranking algorithm is more complex than the one in 2000.

Comments (2)

Scale in distributed systems

I found an old paper “Scale in Distributed Systems” published at 1994, but it is quite useful, it summaries the problems we will ecounter when we want to deploy our system to very large scale, The paper gives us a set of principles for scalable systems along with a list of questions to be asked when considering how far a system scales, if you need to design a distributed system recently, maybe it is worth to read. :)

Leave a Comment

Mashup os

Today, we discuss a sosp paper “Protection and Communication Abstractions
for Web Browsers in MashupOS”, as we known that MashupOS is very popular now, maybe it is the killer to the tradition operation system, maybe in the future, we can do anything by the brower, the data are kept in the internet, in some big services companies like google, facebook, yahoo, live, msn. The paper is to give communication abstractions for browsers and how to protect them. we know that, the current browsers do not support the communications between the different domains, but MashupOS need these communication, so maybe in the future the browsers will support these functionalities, so in that point, browsers will be like a new operation system. How to avoid those problems we ecounter now (memory leak, buffer overflow) ? This paper is to want to solve these problems in the mashup os.

Comments (1)

Cannot access the wordpress at home

Disappointed! Why chinese government do this?

Leave a Comment

“I’m feeling lucky” button on google web site

Click this button, you will directly go to the first result of your search. Hehe, I never use this button utill today. Have you ever use this button?

Leave a Comment

Survey on Network simulator

The current network simulator is too detailed (packet level). from this talk, it seems that the accuracy is quite fragile, so maybe we should just use the machines, the switches to build a simple test-best. Is that enough for us when we want to build some geodistributed applications at the beginning stage?

Comments (2)

Geodistributed service simulation platform

Recently, I am thinking about how to build up a geodistributed service, as you known, the geo-test bed is hard to build up, since we have planetlab, but that is a shared cluster, you cannot exclusive to use this cluster which is different to the real.

So i think we need to build a simulation to simulate the geo-environment so that we can refine our raw design on it before we deploy it to the real world. this simulation platform should be transparent to the application, and has a topology layer so the user can define his own special topology.

I know there are dozens of similiar simulator, but which one is fit for our requirements? ns2? if you are the expert of the network simulator, please give me some advices. thanks a lot.

Comments (9)

« Newer Posts · Older Posts »