Home > Computer Architecture, Computer Science > A Computer Architecture Study with SimpleScalar Instructor Kit

A Computer Architecture Study with SimpleScalar Instructor Kit

SimpleScalar is a tool to simulate programs on different architectures. This tool gives in-depth information about the execution, which we can refer to while analyzing the architecture or the software. In its website, there are instructor benchmarks that are aimed to computer architecture courses. I ran these benchmarks and collected some interesting results, which demonstrates various performance issues on processor architectures.

NOTE : Sorry about crappy figures, I did not have time to prepare them instead just took screenshots from the report.

Background

It has been told everywhere that memory is the bottleneck of today’s computers. Experts work hard to tackle down slow-downs caused by memory performance, and they proposed many solutions during the process. Maybe the most important one is caching.

Consider you are working on a project, and you need various resources(books, papers etc.) while writing your report. It is not impossible for you to keep all these resources on your desk, so you have to figure out some way to organize your resources. Logically, you decide to put the resources that you use most often on your desk. Resources that are less often used are on your shelf, and there are some resources that you rarely use which you didn’t even bring to your home and leave it at the library. In such a scenario, you wouldn’t want to go to library for every paragraph you will put on your report. If you fail at organizing your resources well, then you’d have to lose a serious amount of time while trying to acquire some data from the resources at library.

Well, very simply put, your desk is your L1(L for level) cache, shelf is L2 and finally library is the main memory of your computer (most likely, there are more levels than these in the computer you are viewing this essay from). Your hardware works very hard to organize the data it will use into these levels of memory, and success of this organization determines your computer’s performance most of the time.

Running the Benchmarks

I assume that you are at least familiar with the terms clock cycle and machine instruction.

Here are the data I obtained after running four benchmarks:

CPI : clocks per instruction; in average, how many clock cycles passed to execute one instruction

L1 miss rate : #L1 misses / #L1 accesses, demonstrates how often did we need to look into our shelves(L2 cache).

L2 miss rate : #L2 misses / #L2 accesses, in turn, demonstrates how often did we need to go to library (main memory).

Let’s analyze some of the interesting points

Too high CPI in anagram benchmark

First thing that needs attention in the graph is that anagram benchmark resulted with a CPI of above 2.5 which is a lot more than others. Possible reason for that might be high L2 cache miss rate. Above one data request in two requests is satisfied using memory, which might have caused additional stalls.(Fig. 1)

Another thing that I will discuss is translation lookaside buffer. TLB is a different cache that basically caches results of virtual to physical memory address translations. What we see in terms of TLB in this benchmark is that it has caused a TLB miss rate of 0.0053 which is almost 2.5 times higher than the closest one. Consequently, it required much more page table accesses than other three benchmarks. (Fig. 3)

If we encounter a miss in TLB then we need to do a page walk in order to find the information that we need. You can think of this as finding a new book in the library; if you used that particular book before, then you can go and get it without bothering much; however, if you didn’t use it then you need to walk around a little bit between shelves.

compress benchmark has the best CPI, although very high cache miss rate

Another thing is that compress benchmark achieved the best CPI among all. However, cache miss rates are significantly higher than others (Except L2 miss rate of anagram benchmark). The reason might be less memory requirement for the benchmark.

To prove that, we need to look at load/instr ratio. As the name suggests load means a specific machine instruction that instructs the processor to go and load some data from the memory. (An important note here is : you cannot make decision whether data should be brought from cache or memory as assembly programmer, this decision is made in hardware level). And there we can see that this benchmark required more than ten times less memory accesses than its closer competitor. (Fig. 4)

Second issue that we see here is something different than memory. It is branch. There are hundreds of pages long book chapters about this topic but I will try to give a basic idea. Branches are instructions that might cause the flow of control in your program to change. Various high level language structures have that ability; conditional statements(if, switch), any kind of loop (for, while, foreach etc.) or function calls and return statements. What is tricky is conditional branches (all branches except function calls and returns, which have a specific name: jump).

Processors have the ability to execute more than one instruction at the same time(I know that most developed commercial unicore processor P4 could execute more than a hundred instructions at the same time). To use this ability, they have to somehow “guess” whether control of flow will change in branch instructions. This is a very complex process and naturally it is possible for a processor to make a mistake, which will result a processor to cancel some of the executed instructions and recover itself. It is obvious that this is a very costly process.

Coming back to our benchmark, we can see that compress benchmark has a lot higher IPB (instruction per branch) value than others. (Fig 4) This means there are more instructions between two branches in this benchmark. Another proof is that it has less Instruction L1 miss rate than others. We can interpret this as : “while running compress benchmark, processor did a well job guessing the instructions it will execute and arranged caches accordingly”. This regular instruction sequence can be another reason why compress benchmark has the best CPI.

Conclusion

In this study, I tried to show some important points while analyzing a processor’s performance. Of course, another aspect of this study shows that efficiency of algorithm can be dramatically affected by the way it is compiled or written in assembly language.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: