Doing some benchmarks

08-20-2011, 01:48 PM

Okay, since people often claim how some code is super fast and other code is terribly slow I decided to set up a testing environment where I could test such vague statements and quantify them with solid numbers.

Since stopwatch natives no longer work with the current patch, I had to resort to FPS tests when doing benchmarks. The problem with this is that it only tells you which of the two codes you are repeating X thousand times per second is faster, it doesn't tell you how much faster. The relationship between code execution cost and FPS is not linear, getting 20fps for one code and 40fps for another doesn't meant that the latter is twice as fast.

I get around this problem by varying the number of times I repeat each code until they both result in the same FPS drop. At that point, I can simply compare the number of times I had to repeat each code to see how much longer one of them took to execute than the other. Thus, my testing environment looked like this:

JASS:

scope test

    globals
        private integer i
    endglobals

    private function Test takes nothing returns nothing
        // a few comment lines
        // to prevent inlining
    endfunction

    //! textmacro TenCalls
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    //! endtextmacro

    private function Periodic takes nothing returns nothing
        set i=100
        loop

            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            // 100 calls
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            // 200 calls
            set i=i-1
            exitwhen i==0
        endloop
        //call BJDebugMsg("op limit not reached") // used to confirm that the code has not reached the op limit.
    endfunction

    private function Init takes nothing returns nothing
        call TimerStart(CreateTimer(), 0.01, true, function Periodic)
    endfunction

endscope

There is some overhead involved when doing the test, namely the timer expiration and the loop, but since I copy the textmacro many times the cost of the overhead should be insignificant and besides, it is roughly the same for all test cases (I vary the value i a bit, but not by much) so it is constant and shouldn't affect the results.

With a test environment like that, I would simply write the code I wanted to test, copy it ten times and put it in a textmacro, then run that textmacro a bunch of times in the periodic function. Then, I would keep changing the number of times I run the textmacro and/or repeat the loop and testing the map until I would get a specific FPS (20fps in my case). Once that was reached, I would count the number of times I ran the code and document that. Here are the results:

JASS:
```
    call Test()
```
A simple function call. As is shown in the full code above, I ran this 20000 times per update to get the target FPS. When compared to other things that I tested later, simple function calls turn out to be quite fast, so I decided to make this my basic benchmarking unit, one function call or FC for short. If someone else repeated my tests on different hardware or with a different target FPS, they would get a different number of code executions they can do per update, but if they then calculated the FC values, those should match mine unless function calls have different resource costs on different hardware.
JASS:
```
    call Test2(1)
```
A function call with an integer argument is the next thing I decided to test. This is equivalent to calling a non-static method without arguments. Note that in most cases, you would use a variable or even an array as the argument, which is presumably more costly than a simple static value, but is not a part of this benchmark. As expected, adding an argument increases the cost of the function call, I could only run 12000 at the same FPS, which makes the cost of a function call with one argument 1.67 FC.
JASS:
```
    call Test3(1,2,3,4,5)
```
The next step, a function with multiple arguments, five in this test case. I could only do 4400 of these calls, which makes their cost 4.55 FC, which makes the cost of each argument about 0.7FC, which matches reasonably well with the previous test.
JASS:
```
    call Test.execute()
```
Compared to function calls, .execute is very slow at only 350 instances or a cost of about 57 FC.
JASS:
```
    call Test.evaluate()
```
Evaluate doesn't fare much better at 550 instances or a cost of 36 FC. I just realized that I was testing this in debug mode, where JassHelper doesn't do inlining. If I tested this outside debug mode, the cost should be reduced by 1 FC, which isn't very significant.
JASS:
```
    call Test3.execute(1,2,3,4,5)
```
Executing a function with arguments is even slower, 230 instances give a cost of 87 FC, but note that unlike an .execute without arguments which is 57 times slower than a corresponding function call, this is only 19 times slower than calling test3.
JASS:
```
    set i=i+1
```
Next I decided to test a common operation which involves a variable read, an integer addition and a variable set. I could do about 15500 such operations per timer update, which gives a cost of 1.29 FC, however, I then realized that I had made the global integer i a private variable, which means its name was something like test5_i. Since longer variable names are slower than shorter ones, I decided to repeat the test and could get 16500 operations on the second try, giving a cost of 1.21 FC. I also tried changing i to a local variable, but there was no significant difference in performance.
JASS:
```
    call SetUnitX(u,0.0)
    call SetUnitY(u,0.0)
```
This is the next thing I tried, another operation we often do in fast periodic triggers. I could run about 2000 instances of this, so the cost is roughly 10 FC. However, this is hardly all that is involved in moving projectiles or knocking units backs, so I tested some more a bit more complicated examples:
JASS:
```
    set a=a+0.05
    if a>bj_PI*2.0 then
        set a=0.0
    endif
    set x=Cos(a)*128.0
    set y=Sin(a)*128.0
    call SetUnitX(u,x)
    call SetUnitY(u,y)
```
A simple cyclic example of unit movement, this can be run about 800 times which gives us a cost of 25 FC. 10 of that is used by SetUnitX/Y, so we see that even the simple maths involved in moving units is not insignificant. However, this is not the most typical example, we don't always need to do trigonometry and perhaps that is the most costly part, so we do another test:
JASS:
```
    set x=x+0.05
    set y=y+0.02
    if x>500.0 then
        set x=0.0
    endif
    if y>500.0 then
        set y=0.0
    endif
    call SetUnitX(u,x)
    call SetUnitY(u,y)
```
This does somewhat better, 1200 instances or a cost of 16.7 FC. Compared to 10 FC needed for SetUnitX/Y, the overhead here is much smaller. As I was doing this test, though, I realized it was still not very representative. We usually work with structs, which means we don't use global variables, we use global arrays. Time to tweak our test case some more:

JASS:

    set X[N]=X[N]+0.05
    set Y[N]=Y[N]+0.02
    if X[N]>500.0 then
        set X[N]=0.0
    endif
    if Y[N]>500.0 then
        set Y[N]=0.0
    endif
    call SetUnitX(U[N],X[N])
    call SetUnitY(U[N],Y[N])

Doing this brings us down to 880 instances or 22.7 FC, but wait, I made all my variables private by mistake again, once I remove the private keyword and thus make the variable names as short as possible, I can get nearly 1000 instances or 20 FC.

JASS:
```
        if 1==1 then
        endif
```
This is one more thing I tried testing, but I hit the op limit around 50000 instances before hitting 20fps, so I threw in a thousand SetUnitX/Y calls to use up half of my available processing power. With the rest, I could get 34000 instances of this if statement, which gives a total of 64000 instances and a cost of 0.3 FC. That's pretty fast, but all I was testing was an if statement and an integer comparison. Let's try something bigger:
JASS:
```
        if X[N]>=0.0 then
        endif
```
Here we go, global read, array read, real comparison and an if statement, this brings our instance count down to 20000, so the cost is roughly 1 FC.

JASS:

        call TriggerEvaluate(T)

Adding multiple conditions to one trigger is supposed to be significantly faster than evaluating multiple triggers with one condition. Let's test this:

Results::

number of conditions	1	2	3	5	10
number of instances	650	370	260	165	85
cost (FC)	30.8	54.1	76.9	121.2	235.3
cost per cond.(FC)	30.8	27.1	25.6	24.2	23.5

There is a difference, but it is not very significant. The true cost of a trigger condition is around 22.7 FC with an overhead of 8.1 FC for evaluating the trigger.

The main thing of interest for me is to compare this to the cost of an .evaluate, which is 36 FC. Why is there an overhead of 5 FC when .evaluate only calls a single wrapper, which should cost only 1 FC? I suspect the rest of the overhead comes from JassHelper using an array with a long name, as opposed to my variable T. Perhaps the length of the function name also affects results? My first test was done with a function that had a name 10 characters long. Let's try other lengths:

JASS:
```
        call ThisIsAFiveTimesLongerFunctionNameThan__TestX_Test()
```
At 11500 instances, the cost of calling this function is 1.74 FC, quite a difference. However, when I tested the same thing with a function name 1 character long, there was hardly any difference compared to 10 characters. I could run about 21200 instances, but when I repeated the first test more accurately it turned out I could get a bit over 20000 instances there as well, so the difference is negligible. The length of a function name appears to make a difference only when it's very long.

The above examples were all very theoretical, even when I tried to simulate practical uses I could only speculate about what users would do in their periodic triggers, so I decided to do a much more practical stress test, using an actual projectile library. I tested xemissile, a rather well optimized script designed specifically for simulation of WC3 missiles and nothing more. This is the simplest script I could think of that actually does something useful. A simple knockback system could perhaps be a bit less costly, but that's about it. I was only testing running costs, not allocation, so I simply created a bunch of very slow missiles when the map started and then watched the FPS as they moved towards their target:

JASS:

    private function Init takes nothing returns nothing
        set i=105
        set u=CreateUnit(Player(0), 'hfoo', 500.0,0.0,0.0)

        loop
            call xemissile.create(-1000,GetRandomReal(-500,500),0, 1000,0,0).launch(25,0.1)
            //call xehomingmissile.create(-1000,GetRandomReal(-500,500),0, u,0).launch(25,0.1)
            set i=i-1
            exitwhen i==0
        endloop
    endfunction

I could run either 105 xemissiles or 60 xehomingmissiles. Keep in mind that I was testing this with XE_ANIMATION_PERIOD set to 0.01 so that I could compare it to the rest of my tests, in practice you can have a lot more missiles than that. An xemissile therefore costs 190 FC if it has a static target and 333 FC if it has a moving target.

That's all I have so far. You're welcome to make your own tests similar to these, but before you use the FC unit to measure processing costs please first repeat some of my tests and check if you get the same ratios between results in order to confirm that FC is a reliable way to measure this.

09-19-2011, 02:13 PM

BBQ

I would like to request a "newer" benchmark of array reading against hash table reading. If the hash table turns out to be very slow, then I would also like to see a comparison between 2D arrays and hash tables.

Furthermore, you could test if the read/write speed of a hash table depends on the amount of objects stored within it (it shouldn't, but still...).

I would do those benchmarks myself, but I have no WC3 at the moment.

09-20-2011, 02:31 PM

Anitarf

I did the benchmarks you requested. Since you can't have a variable read on its own, I decided to test it as part of an if statement, since I have already done some such tests before. Note that I was comparing practical costs, so each array read also involves one variable read and each hashtable read involves three variable reads (one for the hashtable, two for the keys). My test cases thus pretty much represent what Table inlines and optimizes to. Here are my results:

Table:

code:	count:	cost:	read cost:
if 1==1 then endif	68000	0.3 FC
if X[N]>=0.0 then endif	19200	1.07 FC
if I[N]>=0 then endif	18000	1.11 FC	0.81 FC
if I[N+5*N]>=0 then endif	9500	2.11 FC	1.81 FC
if LoadInteger(H,N,N)>=0 then endif	5700	3.51 FC	3.21 FC
if LoadReal(H,N,N)>=0.0 then endif	5600	3.57 FC	3.27 FC
if LoadInteger(H,N,N)>=0.0 then endif	5500	3.64 FC	3.34 FC

Comments:

I was surprised to find an integer array read and a comparison slower than a real array read and a comparison, since if I had to guess I'd say they're either equally fast or integers are faster. Not sure what causes this difference between integers and reals, the array read or the comparison. Both arrays had the same value (5 and 5.0 respectively) set at that index, so it's not a problem with reading uninitialized values or anything like that. I switched between the two tests multiple times and got consistent results, although the difference is a mere 0.03 FC which is insignificant, so if anyone goes around saying how reals are faster than integers based on this I'll have to slap them.
On the other hand, integers are faster when doing a hashtable read plus comparison, so again I don't know what to make of this. The difference is still insignificant (0.06 FC) so I'm tempted to say this is just a random fluctuation because these tests are not accurate but really, if I tried the same number of instances for both test cases, I would get significantly different FPS. I have to point out again that although there really does seem to be a difference between the two, it is not significant at all.
The last test I did was a real comparison where one of the compared values is an integer; in this case, the engine has to typecast the integer to a real before doing the comparison. Apparently, this is done very very quickly because the difference to a regular real hashtable read and comparison is as insignificant as the differences I was discussing before.

Conclusions:

After subtracting the cost of the if statement itself (0.3 FC), an array read costs about 0.8 FC, a simulated 2D array read costs 1.8 FC and a hashtable read costs 3.2 FC, which is 4 times the cost of an array read and 1.75 times the cost of a simulated 2D array read.

10-12-2011, 10:34 PM

Anitarf

I did some more tests. I decided to test different methods of running code for units in an area. As it turns out, now that group enumerations with a null filter no longer leak, FirstOfGroup based loops perform the best.

Initialization:

    private function Init takes nothing returns nothing
        local integer n=20
        loop
            call CreateUnit(Player(0), 'hfoo', n*50.0-500.0,0.0,0.0)
            set n=n-1
            exitwhen n==0
        endloop
        set b=Filter(function F1)
        call TimerStart(CreateTimer(), 0.01, true, function Periodic)
    endfunction

EnumFilter loop:

    function F1 takes nothing returns boolean
        set u=GetFilterUnit()
        return false
    endfunction

    //! textmacro FilterLoop
        call GroupEnumUnitsInRange(g, 0.0,0.0,1000.0, b)
    //! endtextmacro

ForGroup loop:

    function F2 takes nothing returns nothing
        set u=GetEnumUnit()
    endfunction

    //! textmacro ForGroupLoop
        call GroupEnumUnitsInRange(g, 0.0,0.0,1000.0, null)
        call ForGroup(g, function F2)
    //! endtextmacro

FirstOfGroup loop:

    //! textmacro FirstOfGroupLoop
        call GroupEnumUnitsInRange(g, 0.0,0.0,1000.0, null)
        loop
            set u=FirstOfGroup(g)
            exitwhen u==null
            
            call GroupRemoveUnit(g, u)
        endloop
    //! endtextmacro

I chose a group of 20 units as my test case. I adjusted the number of times each textmacro was repeated in the Periodic function so that I got the same frames per second as in my reference case of 20000 empty function calls.

Results:

	count	FC cost	per unit
Enum filter	31	645 FC	32 FC
ForGroup	28	714 FC	36 FC
FirstOfGroup	75	266 FC	13 FC

The results indicate that group enum filters and ForGroup callbacks are about as costly as trigger evaluations which makes them rather slow. A filter-less enum coupled with a FirstOfGroup loop is almost three times faster.

I did one more batch of tests. I compared how many FirstOfGroup loops I could do with for different numbers of units in the area.

Results:

number of units	1	2	3	5	10	20
number of instances	630	450	355	245	140	75
cost (FC)	32	44	56	82	142	267
cost per unit (FC)	32	22	19	16	14	13

Based on these results, the overhead of enumerating units in an area and then doing a FirstOfGroup loop is 19.7 FC and each unit that ends up in the group costs an additional 12.3 FC.

I also did a test of only doing a ForGroup on a group of 20 units without enumeration. I could run the loop 32 times, which is only marginally more than the 28 I was able to get with the added GroupEnum. The cost of a ForGroup is thus 625 FC or 31 FC per unit. By subtraction, the cost of the GroupEnum is just about 5 FC per unit.

I compared that to a loop through a linked list of units. I could initially do 80 loops through a list of 20 units, but once I tweaked the loop methods on my struct so they would inline properly, I could do 140 loops, which is nearly 5 times as much as I could do ForGroup loops. Of course, a linked list is optimized for looping, but isn't as good for some of the other operations you can do on groups.

During the course of testing, I occasionally repeated my reference test of 20000 function calls on a function with no arguments and a name 10 characters long. As I was testing different numbers of units, I noticed that the target FPS has changed, so the tests I did on the FirstOfGroup loops with different numbers of units were not as accurate. How big was the error? I ran some tests on an empty map and a map with 50 footmen. To get the same FPS, I needed to reduce the number of function calls by 3% when I was testing with 50 stationary units on the map. This isn't such a big deal, in my tests I had 1 to 20 units so my results are still fairly accurate. For fun, I also tested what happened when I ordered those 50 footmen to patrol to a nearby point. The reduction in the number of function calls needed to maintain the same FPS was 5%, so those 50 patrolling footmen cost me 1000 FC.

10-13-2011, 09:10 PM

Anitarf

I recently stumbled upon the following statement:

Quote:

Also, in the exitwhen statement, 0==this is faster than this==0.

I was skeptical about this, I didn't really see a reason for why there should be a difference between the two, so I decided to test it.

I tested this claim with the following two test cases:

Case 1:

        loop
            exitwhen N==0//0==N
        endloop

Case 2:

        set N=10
        loop
            set N=N-1
            exitwhen N==0//0==N
        endloop

Results:

	instances	cost per instance
Case 1, N==0	28000	0.7 FC
Case 1, 0==N	28000	0.7 FC

Case 2, N==0	1050	19 FC
Case 2, 0==N	950	21 FC

The results of the second test case actually suggest that 0==N is slower, but the difference is so small that it's probably due to how the instructions get sorted on my particular CPU, rather than a universal speed difference. I tried changing the order of the exitwhen statement and N=N-1 and got the same results. Considering how small the difference is and the fact that there was no difference on the first test case, the only thing I can reasonably conclude from this is that N==0 and 0==N are equally fast.

10-13-2011, 11:32 PM	#6
Fledermaus	How hard is it to actually update the stopwatch natives? Seems like this patch is gonna be the current one for a while.

10-16-2011, 04:11 AM

PurgeandFire111

Quote:

Originally Posted by Anitarf

I recently stumbled upon the following statement:
I was skeptical about this, I didn't really see a reason for why there should be a difference between the two, so I decided to test it.

Thank you lol. I never understood the logic of that statement. In a logical sense, they would be equal in terms of speed, which apparently is so.

Quote:

How hard is it to actually update the stopwatch natives? Seems like this patch is gonna be the current one for a while.

I think the source code/dl link of the stopwatch natives is down so it would probably be difficult to do unless someone gets in contact with SFilip.

10-16-2011, 11:07 AM

Bannar

Anitarf, could you once again benchmark GroupEnum + filter (performing actions) with GroupEnum + FirstOfGroup enumeration but with differend circumstances?
What I mean is moving those closer to real environtment when user uses multiple comparisons (for unit selection) and with higher unit count. Meaby even with unit-type mix.

I'm just unsure of current results since those are 'perfect circumstances' which to be honest never happen. Jassfags usually, while writing spells/systems implement ton of selective stuff to consider all the options making thier work bugfree.

Thanks in advance.

10-16-2011, 05:22 PM

Anitarf

Quote:

Originally Posted by Inferno

The comparisons shouldn't matter since they need to be done for each unit regardless of whether you're doing those comparisons in a filter or in a FirstOfGroup loop. So no matter how much code you add to the loop, the cost of the loop increases by the same amount for both looping methods, so the absolute difference in speed between the two approaches remains the same.

10-16-2011, 07:28 PM

#10

Bannar

Anitarf, I somewhat agree in case situation it's similar to the "N==0 and 0==N" issue, although I started to be unsure again after reading random responses. And if you scroll down a bit to Bribe's post, you can see that he is mentioning drop in speed in favour of FirstOfGroup to two times instead of three compared to GroupEnum + filter.

What I thought is that, the more filter code we add (selection stuff) the smaller the difference between GroupEnum/FoG and GroupEnum/filter becomes. Although it's still ends up being two times faster.

You know, those benchmarks have greatly changed the sight on some major things, especialy group enumeration methods.

10-16-2011, 09:28 PM

#11

Anitarf

Quote:

Originally Posted by Inferno

Yes, he did mention that, but he didn't describe his methodology, he didn't post any test code or raw test results nor any reference tests, so I can't really know what he was measuring in the first place. Did he test raw loops like I did or did he add some unit filters to the loop like others have suggested? I can't know that. Maybe he did do a much better test than I, or maybe he just pulled that number out of his ass.

As for other comments in that thread, yes, I did only do these tests on one computer, it's the only one I have. I'm not sure what people who object to this expect me to do, go out and buy a new computer so I can do more WC3 benchmarks? I don't think so. When I first posted this thread I did invite others to replicate my tests on different hardware so if you feel my results are suspect then by all means, repeat my tests. If you don't then you don't get to complain.

The comment about the op limit may be a valid concern, but there are other ways of avoiding it besides moving some code to enum filters. For example, in a missile system, you could run the periodic update function of every missile that does collision checks with an .evaluate instead of a regular function call. The cost of doing so will easily be offset by using the faster FirstOfGroup loop as long as there are enough units in range (which there should be since the whole justification for why the op limit might be a problem was that missile collision checks could be enumerating large amounts of units).

10-26-2011, 11:00 PM

#12

Anitarf

I ran some more tests to verify the feasibility of dummy unit recycling. First, I wrote a simple dummy unit recycling library that maintains multiple queues of dummies at different facing angles so that the dummies are suitable for use as projectiles:

JASS:

library xedummy requires xebasic

    globals
        private constant integer ANGLE_RESOLUTION = 12
        private constant integer DUMMY_PRELOAD_COUNT = 5 // Not yet implemented.
    endglobals

// END OF CALIBRATION SECTION
// ================================================================

    private struct recycleQueue extends array
        recycleQueue next
        recycleQueue prev

        real angle

        integer size
        xedummy first
        xedummy last
        static method onInit takes nothing returns nothing
            local integer i=0
            loop
                exitwhen i==ANGLE_RESOLUTION
                set i=i+1
                set recycleQueue(i).prev=recycleQueue(i-1)
                set recycleQueue(i).next=recycleQueue(i+1)
                set recycleQueue(i).angle=(i-0.5)*(360.0/ANGLE_RESOLUTION)
            endloop
            set recycleQueue(1).prev=recycleQueue(i)
            set recycleQueue(i).next=recycleQueue(1)
        endmethod

        static method get takes real angle returns recycleQueue
            return recycleQueue(R2I(angle/360.0*ANGLE_RESOLUTION)+1)
        endmethod
    endstruct

// ================================================================

    struct xedummy
        private static group g=CreateGroup()
        private unit u

        // ----------------------------------------------------------------

        private xedummy next
        
        private method queueInsert takes recycleQueue q returns nothing
            call SetUnitFacing(.u, q.angle)

            if q.size==0 then
                set q.first=this
            else
                set q.last.next=this
            endif
            set q.last=this
            set .next=0

            // Check adjacent queues to see if they have fewer dummies than this queue.
            // If they do, move the first dummy from this queue to an adjacent queue.
            // This operation is recursive so it will find the local minimum queue and
            // increase the size of that rather than a neighbouring larger queue.
            if q.size>q.next.size then
                set this=q.first
                set q.first=.next
                call .queueInsert(q.next)
            elseif q.size>q.prev.size then
                set this=q.first
                set q.first=.next
                call .queueInsert(q.prev)
            else
                set q.size=q.size+1
            endif
        endmethod
        
        private static method queueRemove takes recycleQueue q returns xedummy
            // Check adjacent queues to see if they have more dummies than this queue.
            // If they do, move the first dummy from that queue to this one.
            // This operation is recursive so it will find the local maximum queue and
            // decrease the size of that rather than a neighbouring smaller queue.
            local xedummy this
            if q.size<q.next.size then
                set this=q.last
                set q.last=.queueRemove(q.next)
                set .next=q.last
                call SetUnitFacing(q.last.u, q.angle)
            elseif q.size<q.prev.size then
                set this=q.last
                set q.last=.queueRemove(q.prev)
                set .next=q.last
                call SetUnitFacing(q.last.u, q.angle)
            else
                set q.size=q.size-1
                if q.size==0 then
                    set q.last=0
                endif
            endif

            set this=q.first
            set q.first=.next
            set .next=0
            return this
        endmethod
    
        // ----------------------------------------------------------------

        private static method create takes unit u returns xedummy
            local xedummy this
            if GetUnitTypeId(u)!=XE_DUMMY_UNITID then
                debug call BJDebugMsg("xedummy.release error: Method called on a unit of an incorrect type.")
            elseif IsUnitInGroup(u, .g) then
                debug call BJDebugMsg("xedummy.release error: Method called on an already released unit.")
            else
                set this=.allocate()
                set .u=u
                call GroupAddUnit(.g, u)
                call .queueInsert(recycleQueue.get(GetUnitFacing(u)))
                call ShowUnit(u, false)
                return this
            endif
            return 0
        endmethod

        private method destroy takes nothing returns nothing
            call GroupRemoveUnit(.g, .u)
            call ShowUnit(.u, true)
            set .u=null
            call .deallocate()
        endmethod

        // ----------------------------------------------------------------

        private static unit dummy
        static method new takes player p, real x, real y, real face returns unit
            local recycleQueue q
            local xedummy this
            loop
                exitwhen face>0.0
                set face=face+360.0
            endloop
            loop
                exitwhen face<360.0
                set face=face-360.0
            endloop
            set q=recycleQueue.get(face)
            if q.size==0 then
                set .dummy = CreateUnit(p, XE_DUMMY_UNITID, x,y,face)
                call UnitAddAbility(this.dummy,XE_HEIGHT_ENABLER)
                call UnitAddAbility(this.dummy,'Aloc')
                call UnitRemoveAbility(this.dummy,XE_HEIGHT_ENABLER)
            else
                set this=.queueRemove(q)
                set .dummy=.u
                call .destroy()
                call SetUnitX(.dummy, x)
                call SetUnitY(.dummy, y)
                call SetUnitFacing(.dummy, face)
                call SetUnitOwner(.dummy, p, true)
            endif
            return .dummy
        endmethod

        static method release takes unit u returns nothing
            call .create(u)
        endmethod
    endstruct

endlibrary

In the first test, I simply compared how many units I could create/recycle on a periodic 0.01 second timer to get the same fps drop as calling a function 20000 times, same as in my earlier tests. I didn't test only unit creation, I had to also include unit removal, else I wouldn't get stable results as the number of units in the map would increase rapidly. In practice, we also have to eventually remove all the units we create so that's all right.

Table:
Code: Count: Cost:
call RemoveUnit( CreateUnit(Player(15), XE_DUMMY_UNITID, 0.0,0.0,0.0) ) 13 1538 FC
call xedummy.release( xedummy.new(Player(15), 0.0,0.0,0.0) ) 57 351 FC
We can see from these results that creating (and removing) a unit is a rather costly operation and that recycling it is nearly 5 times faster, possibly more once the map is optimized.

For my second test, I edited xefx to use the library above instead of creating and removing dummy units, then compared how many xemissiles I could run with the edited xefx compared to the original. I used the following test code:

JASS:

library test initializer onInit requires xemissile
    private function Periodic takes nothing returns nothing
        // create a missile with a random target
        local real x=GetRandomReal(-1024,1024)
        local real y=GetRandomReal(-1024,1024)
        local real tx=GetRandomReal(-1024,1024)
        local real ty=GetRandomReal(-1024,1024)
        local xemissile m=xemissile.create(x,y,0,tx,ty,0)
        // calculate the flight distance
        set tx=tx-x
        set ty=ty-y
        set x=SquareRoot(tx*tx+ty*ty)
        // launch with such speed that the distance will be covered in one second
        call m.launch(x, 0.15)
        //set m.fxpath="Abilities\\Weapons\\LordofFlameMissile\\LordofFlameMissile.mdl"
    endfunction

    private function onInit takes nothing returns nothing
        call TimerStart(CreateTimer(), 0.01, true, function Periodic)
    endfunction
endlibrary

At the start, I created 100 xemissiles per second, each missile lasted 1 second which seemed like a reasonably common value based on my experience from ET. The xe update period was set to 0.025 so each missile went through 40 update cycles in addition to the costs of being created and destroyed. To get the same FPS with the unedited xefx, I had to increase my test timer frequency to 0.0125 which means the number of missiles created per second dropped from 100 to 80. Using dummy unit recycling thus allowed me to get 25% more missiles for the same cost. Keep in mind though that updating xemissiles is relatively cheap, so recycling the dummy units would make less of a difference on a more complicated missile system since those 40 updates would represent a larger proportion of the total cost.

10-30-2011, 12:35 PM

#13

cohadar

Table:


if LoadReal(H,N,N)>=0.0 then endif	5600	3.57 FC	3.27 FC
if LoadInteger(H,N,N)>=0.0 then endif	5500	3.64 FC	3.34 FC

Second statement is slower because it has implicit conversion done by compiler:
if LoadInteger(H,N,N)>=0.0 then ==> if I2R(LoadInteger(H,N,N))>=0.0 then

You should have used integer zero instead of real zero:
if LoadInteger(H,N,N)>=0 then

Quote:

Originally Posted by Anitarf

so if anyone goes around saying how reals are faster than integers based on this I'll have to slap them.

I think reals could really be faster than integers, I think we should all recode all our stuff to use reals just to be sure.

10-31-2011, 05:29 AM	#14
Nuclear Arbitor	excellent suggestion, you get right on it

11-01-2011, 11:04 PM

#15

Anitarf

Following up on my previous post, I decided to make a comparison between the dummy recycling library I wrote above and MissileRecycler that was submitted recently by Bribe. Since Bribe uses timers and I don't I was expecting his library to be slightly slower, but not significantly.

Before I could benchmark the libraries, though, I had some tweaking to do. Since MissileRecycler's units did not immediately become available for recycling, I had to preload enough of them for my stress test. To avoid creating too many units, I reduced the number of directions to 3, this is a useless number in practice but the speed of recycling shouldn't really change with a higher number so my results are valid. I preloaded 70 units per direction so I had 210 units on my map in total.

During testing, I was surprised that my library was performing significantly worse than Bribe's. It turned out that the culprit was the ShowUnit native: I was hiding my units on the recycle queue and Bribe wasn't. Since there's really no need to hide the dummy units, I removed the ShowUnit calls from my library and repeated the test.

Table:

Code:	Count:	Cost:
call RemoveUnit( CreateUnit(Player(15), XE_DUMMY_UNITID, 0.0,0.0,0.0) )	12.5	1600 FC
call RecycleMissile( GetRecycledMissile( 0.0,0.0,0.0,0.0 ) )	60	333 FC
call xedummy.release( xedummy.new(Player(15), 0.0,0.0,0.0) ) with ShowUnit	42	476 FC
call xedummy.release( xedummy.new(Player(15), 0.0,0.0,0.0) )	68	294 FC

The results are as expected so I wouldn't really have bothered posting this if not for the interesting observation that hiding and showing a unit, even one with the 'aloc' ability, is quite a costly operation: about 180 FC.