JUMP or Conditional Move?
This is more for my remembrance than anything but I though I would put it up here anyway. Large applications written in assembly language can be very jump heavy, these are the way traditional branching happened. Compares then a Jump based on a condition. Taking care not to jump too far (I will not go into details about this)
CPU manufacturers are always looking at providing better performance without always ramping up the number of Hz a CPU can do. There is also a large number of trade offs when looking into mobile processors where energy is also of concern. This is where pre-fetching and decoding instructions started to show benefits. Thus the introduction of branch prediction which tries to “intelligently” guess where the code will jump to when it crosses branching.
So, more so than with any other language, assembly needs to have a lot of care taken as to optimise to take advantage of the processors and also new features introduced into processors. Prior to the introduction of the Pentium Pros (P6) architecture from Intel, assembly was very jump heavy. Below is an example of that, it gets a called from a C++ program so please ignore how the application starts.
.486 ; 32bit 486 Processor Directive .model flat, c .data ; Data Segment - Read/Write .code ; Code Segment - Read/Execute FindSmallest proc export mov edx, dword ptr [esp + 4] ; edx = *int mov ecx, dword ptr [esp + 8] ; ecx = Count mov eax, 7fffffffh ; eax will be our answer cmp ecx, 0 Are there 0 items? jle Finished ; If Jump to Finish MainLoop: cmp dword ptr [edx], eax ; Compare *edx and eax jl IsLessThan ; Jump if less than to IsLessThan jmp CarryOn ; Skip over the move operation IsLessThan: mov eax, dword ptr [edx] ; Movw *edx into eax CarryOn: add edx, 4 ; Move to the next integer (position in array) dec ecx ; Decrement counter jnz MainLoop ;Loop IF ecx is not zero Finished: ret ; return with the lowest in eax FindSmallest endp ; End of the procedure end ; Closes the segment
OK, this will execute nicely on any processor, even though I have .486 processor directive I could also use .386 and it would work the same since I am using 32bit registers.
So, what I am doing is conditional jumping. When I perform the compares or performing operations, I will jump if it is less than, jump if no zero, jump (this is different in some manner).
So what is wrong with this? OK, CPUs use something that is called branch prediction. It pre-reads the code and determines the jumps (branching) based on an algorithm. This is a fairly complex article, so I am sorry to people who might read my blog and get confused with everything so far.
So, processors have multiple code pipelines, this is essentially what enables this branch prediction, the main pipeline is executing the code the others are reading ahead in the code and determining the possible branches based on different factors. If this prediction is incorrect which is can be, the benefits of pre-reading the code is lost and the application must now perform the actual work from the code. Wait, doesn’t it need to perform the actual work? Yes it does, but the read-ahead is not writing data to memory it isn’t decrementing counters or moving data from one register to another, it is merely working out the jumps that need to happen, if it predicts right then the line of code it will read is right and the IP will be set correctly. If not, then, well it loses the benefit and keeps on trucking.
So where am I going with this. The Conditional Move, since a lot of the time we are moving data in and out of RAM and/or registers, we are jumping and the like. It isn’t the only thing done, but it is done a lot. In circumstances where like above, based on certain conditions we will move information from memory into EAX and then we loop back and continue. But when the Pentium Pro was introduced so to was the conditional moves. The difference is, there is no branching needing to be done therefore the read-ahead will continue and we will not loose any benefits from reading ahead.
.686 ; 32bit Pentium Pro Processor .model flat, c .data ; Data Segment - Read/Write .code ; Code Segment - Read/Execute FindSmallest proc export mov edx, dword ptr [esp + 4] ; edx = *int mov ecx, dword ptr [esp + 8] ; ecx = Count mov eax, 7fffffffh ; eax will be our answer cmp ecx, 0 Are there 0 items? jle Finished ; If Jump to Finish MainLoop: cmp dword ptr [edx], eax ; Compare *edx and eax cmovl eax, dword ptr [edx] ; Movw *edx into eax add edx, 4 ; Move to the next integer (position in array) dec ecx ; Decrement counter jnz MainLoop ;Loop IF ecx is not zero Finished: ret ; return with the lowest in eax FindSmallest endp ; End of the procedure end ; Closes the segment
This code does the same thing, the difference instead of the extra jumps, it will compare and if the memory item is less than the item in EAX it will move, otherwise a NOP is executed. There is no possibility with this being mispredicted as it isn’t jumping anywhere (changing the IP to anything other than being incremented which happens regardless.
I remember when I first started writing assembly, the top code was the only way it could be done. Pentiums came in but not using the P6 architecture. Granted back then multiple code pipelines weren’t common in processors and code was just executed in order. A lot of this also brings more light into the Out of Order Execution as well, which relies greatly on branch prediction.
If the branching is based on constants or repetitive tasks, looping 4 times every time, the pipelines will be fully utilised and the maximum benefits of the deep pipelines is obtained. When writing assembly code we can often optimise our code to use these techniques to help the processors get its maximum benefits.
So if you know you are targeting the Pentium Pro processor or greater then it is best to look at the way your code branches and see if it can make use of the conditional moves. It is best to avoid branching in assembly code, unless it is a fixed branch (unconditional) but there is always a need to perform branching in applications and removing it seems like a LOT of extra work. Compilers in high level languages can often do this if it knows specifics about how many times it needs to or there is a specific condition met. But it can often be based on dynamic data and these are the most difficult to predict, even using the historical information to predict future outcome is only as good as the amount of history than can be stored or the determination of what history.
Intel put a lot of effort into future processors improvements in this branch prediction as anytime a branch is mispredicted there is a performance penalty and all energy used to decode the code after the branch is wasted and with mobile processors and our reliance on mobile devices lasting longer and longer this is a big issue.