HomeUser Control Panel (unavailable in archive)ForumsTutorialsArt GalleryResourcesMaps

Doing some benchmarks

08-20-2011, 01:48 PM#1
Anitarf
Okay, since people often claim how some code is super fast and other code is terribly slow I decided to set up a testing environment where I could test such vague statements and quantify them with solid numbers.

Since stopwatch natives no longer work with the current patch, I had to resort to FPS tests when doing benchmarks. The problem with this is that it only tells you which of the two codes you are repeating X thousand times per second is faster, it doesn't tell you how much faster. The relationship between code execution cost and FPS is not linear, getting 20fps for one code and 40fps for another doesn't meant that the latter is twice as fast.

I get around this problem by varying the number of times I repeat each code until they both result in the same FPS drop. At that point, I can simply compare the number of times I had to repeat each code to see how much longer one of them took to execute than the other. Thus, my testing environment looked like this:
Collapse JASS:
scope test

    globals
        private integer i
    endglobals

    private function Test takes nothing returns nothing
        // a few comment lines
        // to prevent inlining
    endfunction

    //! textmacro TenCalls
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    call Test()
    //! endtextmacro

    private function Periodic takes nothing returns nothing
        set i=100
        loop

            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            // 100 calls
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            //! runtextmacro TenCalls()
            // 200 calls
            set i=i-1
            exitwhen i==0
        endloop
        //call BJDebugMsg("op limit not reached") // used to confirm that the code has not reached the op limit.
    endfunction

    private function Init takes nothing returns nothing
        call TimerStart(CreateTimer(), 0.01, true, function Periodic)
    endfunction

endscope
There is some overhead involved when doing the test, namely the timer expiration and the loop, but since I copy the textmacro many times the cost of the overhead should be insignificant and besides, it is roughly the same for all test cases (I vary the value i a bit, but not by much) so it is constant and shouldn't affect the results.

With a test environment like that, I would simply write the code I wanted to test, copy it ten times and put it in a textmacro, then run that textmacro a bunch of times in the periodic function. Then, I would keep changing the number of times I run the textmacro and/or repeat the loop and testing the map until I would get a specific FPS (20fps in my case). Once that was reached, I would count the number of times I ran the code and document that. Here are the results:
  • Collapse JASS:
        call Test()
    A simple function call. As is shown in the full code above, I ran this 20000 times per update to get the target FPS. When compared to other things that I tested later, simple function calls turn out to be quite fast, so I decided to make this my basic benchmarking unit, one function call or FC for short. If someone else repeated my tests on different hardware or with a different target FPS, they would get a different number of code executions they can do per update, but if they then calculated the FC values, those should match mine unless function calls have different resource costs on different hardware.
  • Collapse JASS:
        call Test2(1)
    A function call with an integer argument is the next thing I decided to test. This is equivalent to calling a non-static method without arguments. Note that in most cases, you would use a variable or even an array as the argument, which is presumably more costly than a simple static value, but is not a part of this benchmark. As expected, adding an argument increases the cost of the function call, I could only run 12000 at the same FPS, which makes the cost of a function call with one argument 1.67 FC.
  • Collapse JASS:
        call Test3(1,2,3,4,5)
    The next step, a function with multiple arguments, five in this test case. I could only do 4400 of these calls, which makes their cost 4.55 FC, which makes the cost of each argument about 0.7FC, which matches reasonably well with the previous test.
  • Collapse JASS:
        call Test.execute()
    Compared to function calls, .execute is very slow at only 350 instances or a cost of about 57 FC.
  • Collapse JASS:
        call Test.evaluate()
    Evaluate doesn't fare much better at 550 instances or a cost of 36 FC. I just realized that I was testing this in debug mode, where JassHelper doesn't do inlining. If I tested this outside debug mode, the cost should be reduced by 1 FC, which isn't very significant.
  • Collapse JASS:
        call Test3.execute(1,2,3,4,5)
    Executing a function with arguments is even slower, 230 instances give a cost of 87 FC, but note that unlike an .execute without arguments which is 57 times slower than a corresponding function call, this is only 19 times slower than calling test3.
  • Collapse JASS:
        set i=i+1
    Next I decided to test a common operation which involves a variable read, an integer addition and a variable set. I could do about 15500 such operations per timer update, which gives a cost of 1.29 FC, however, I then realized that I had made the global integer i a private variable, which means its name was something like test5_i. Since longer variable names are slower than shorter ones, I decided to repeat the test and could get 16500 operations on the second try, giving a cost of 1.21 FC. I also tried changing i to a local variable, but there was no significant difference in performance.
  • Collapse JASS:
        call SetUnitX(u,0.0)
        call SetUnitY(u,0.0)
    This is the next thing I tried, another operation we often do in fast periodic triggers. I could run about 2000 instances of this, so the cost is roughly 10 FC. However, this is hardly all that is involved in moving projectiles or knocking units backs, so I tested some more a bit more complicated examples:
  • Collapse JASS:
        set a=a+0.05
        if a>bj_PI*2.0 then
            set a=0.0
        endif
        set x=Cos(a)*128.0
        set y=Sin(a)*128.0
        call SetUnitX(u,x)
        call SetUnitY(u,y)
    A simple cyclic example of unit movement, this can be run about 800 times which gives us a cost of 25 FC. 10 of that is used by SetUnitX/Y, so we see that even the simple maths involved in moving units is not insignificant. However, this is not the most typical example, we don't always need to do trigonometry and perhaps that is the most costly part, so we do another test:
  • Collapse JASS:
        set x=x+0.05
        set y=y+0.02
        if x>500.0 then
            set x=0.0
        endif
        if y>500.0 then
            set y=0.0
        endif
        call SetUnitX(u,x)
        call SetUnitY(u,y)
    This does somewhat better, 1200 instances or a cost of 16.7 FC. Compared to 10 FC needed for SetUnitX/Y, the overhead here is much smaller. As I was doing this test, though, I realized it was still not very representative. We usually work with structs, which means we don't use global variables, we use global arrays. Time to tweak our test case some more:
  • Collapse JASS:
        set X[N]=X[N]+0.05
        set Y[N]=Y[N]+0.02
        if X[N]>500.0 then
            set X[N]=0.0
        endif
        if Y[N]>500.0 then
            set Y[N]=0.0
        endif
        call SetUnitX(U[N],X[N])
        call SetUnitY(U[N],Y[N])
    Doing this brings us down to 880 instances or 22.7 FC, but wait, I made all my variables private by mistake again, once I remove the private keyword and thus make the variable names as short as possible, I can get nearly 1000 instances or 20 FC.
  • Collapse JASS:
            if 1==1 then
            endif
    This is one more thing I tried testing, but I hit the op limit around 50000 instances before hitting 20fps, so I threw in a thousand SetUnitX/Y calls to use up half of my available processing power. With the rest, I could get 34000 instances of this if statement, which gives a total of 64000 instances and a cost of 0.3 FC. That's pretty fast, but all I was testing was an if statement and an integer comparison. Let's try something bigger:
  • Collapse JASS:
            if X[N]>=0.0 then
            endif
    Here we go, global read, array read, real comparison and an if statement, this brings our instance count down to 20000, so the cost is roughly 1 FC.
  • Collapse JASS:
            call TriggerEvaluate(T)
    Adding multiple conditions to one trigger is supposed to be significantly faster than evaluating multiple triggers with one condition. Let's test this:
    Results::
    number of conditions123510
    number of instances65037026016585
    cost (FC)30.854.176.9121.2235.3
    cost per cond.(FC)30.827.125.624.223.5
    There is a difference, but it is not very significant. The true cost of a trigger condition is around 22.7 FC with an overhead of 8.1 FC for evaluating the trigger.

    The main thing of interest for me is to compare this to the cost of an .evaluate, which is 36 FC. Why is there an overhead of 5 FC when .evaluate only calls a single wrapper, which should cost only 1 FC? I suspect the rest of the overhead comes from JassHelper using an array with a long name, as opposed to my variable T. Perhaps the length of the function name also affects results? My first test was done with a function that had a name 10 characters long. Let's try other lengths:
  • Collapse JASS:
            call ThisIsAFiveTimesLongerFunctionNameThan__TestX_Test()
    At 11500 instances, the cost of calling this function is 1.74 FC, quite a difference. However, when I tested the same thing with a function name 1 character long, there was hardly any difference compared to 10 characters. I could run about 21200 instances, but when I repeated the first test more accurately it turned out I could get a bit over 20000 instances there as well, so the difference is negligible. The length of a function name appears to make a difference only when it's very long.

The above examples were all very theoretical, even when I tried to simulate practical uses I could only speculate about what users would do in their periodic triggers, so I decided to do a much more practical stress test, using an actual projectile library. I tested xemissile, a rather well optimized script designed specifically for simulation of WC3 missiles and nothing more. This is the simplest script I could think of that actually does something useful. A simple knockback system could perhaps be a bit less costly, but that's about it. I was only testing running costs, not allocation, so I simply created a bunch of very slow missiles when the map started and then watched the FPS as they moved towards their target:
Expand JASS:
I could run either 105 xemissiles or 60 xehomingmissiles. Keep in mind that I was testing this with XE_ANIMATION_PERIOD set to 0.01 so that I could compare it to the rest of my tests, in practice you can have a lot more missiles than that. An xemissile therefore costs 190 FC if it has a static target and 333 FC if it has a moving target.

That's all I have so far. You're welcome to make your own tests similar to these, but before you use the FC unit to measure processing costs please first repeat some of my tests and check if you get the same ratios between results in order to confirm that FC is a reliable way to measure this.
09-19-2011, 02:13 PM#2
BBQ
I would like to request a "newer" benchmark of array reading against hash table reading. If the hash table turns out to be very slow, then I would also like to see a comparison between 2D arrays and hash tables.

Furthermore, you could test if the read/write speed of a hash table depends on the amount of objects stored within it (it shouldn't, but still...).

I would do those benchmarks myself, but I have no WC3 at the moment.
09-20-2011, 02:31 PM#3
Anitarf
I did the benchmarks you requested. Since you can't have a variable read on its own, I decided to test it as part of an if statement, since I have already done some such tests before. Note that I was comparing practical costs, so each array read also involves one variable read and each hashtable read involves three variable reads (one for the hashtable, two for the keys). My test cases thus pretty much represent what Table inlines and optimizes to. Here are my results:
Table:
code:count:cost:read cost:
if 1==1 then endif680000.3 FC
if X[N]>=0.0 then endif192001.07 FC
if I[N]>=0 then endif180001.11 FC0.81 FC
if I[N+5*N]>=0 then endif95002.11 FC1.81 FC
if LoadInteger(H,N,N)>=0 then endif57003.51 FC3.21 FC
if LoadReal(H,N,N)>=0.0 then endif56003.57 FC3.27 FC
if LoadInteger(H,N,N)>=0.0 then endif55003.64 FC3.34 FC

Comments:
  • I was surprised to find an integer array read and a comparison slower than a real array read and a comparison, since if I had to guess I'd say they're either equally fast or integers are faster. Not sure what causes this difference between integers and reals, the array read or the comparison. Both arrays had the same value (5 and 5.0 respectively) set at that index, so it's not a problem with reading uninitialized values or anything like that. I switched between the two tests multiple times and got consistent results, although the difference is a mere 0.03 FC which is insignificant, so if anyone goes around saying how reals are faster than integers based on this I'll have to slap them.
  • On the other hand, integers are faster when doing a hashtable read plus comparison, so again I don't know what to make of this. The difference is still insignificant (0.06 FC) so I'm tempted to say this is just a random fluctuation because these tests are not accurate but really, if I tried the same number of instances for both test cases, I would get significantly different FPS. I have to point out again that although there really does seem to be a difference between the two, it is not significant at all.
  • The last test I did was a real comparison where one of the compared values is an integer; in this case, the engine has to typecast the integer to a real before doing the comparison. Apparently, this is done very very quickly because the difference to a regular real hashtable read and comparison is as insignificant as the differences I was discussing before.

Conclusions:
  • After subtracting the cost of the if statement itself (0.3 FC), an array read costs about 0.8 FC, a simulated 2D array read costs 1.8 FC and a hashtable read costs 3.2 FC, which is 4 times the cost of an array read and 1.75 times the cost of a simulated 2D array read.
10-12-2011, 10:34 PM#4
Anitarf
I did some more tests. I decided to test different methods of running code for units in an area. As it turns out, now that group enumerations with a null filter no longer leak, FirstOfGroup based loops perform the best.

Collapse Initialization:
    private function Init takes nothing returns nothing
        local integer n=20
        loop
            call CreateUnit(Player(0), 'hfoo', n*50.0-500.0,0.0,0.0)
            set n=n-1
            exitwhen n==0
        endloop
        set b=Filter(function F1)
        call TimerStart(CreateTimer(), 0.01, true, function Periodic)
    endfunction
Collapse EnumFilter loop:
    function F1 takes nothing returns boolean
        set u=GetFilterUnit()
        return false
    endfunction

    //! textmacro FilterLoop
        call GroupEnumUnitsInRange(g, 0.0,0.0,1000.0, b)
    //! endtextmacro 
Collapse ForGroup loop:
    function F2 takes nothing returns nothing
        set u=GetEnumUnit()
    endfunction

    //! textmacro ForGroupLoop
        call GroupEnumUnitsInRange(g, 0.0,0.0,1000.0, null)
        call ForGroup(g, function F2)
    //! endtextmacro
Collapse FirstOfGroup loop:
    //! textmacro FirstOfGroupLoop
        call GroupEnumUnitsInRange(g, 0.0,0.0,1000.0, null)
        loop
            set u=FirstOfGroup(g)
            exitwhen u==null
            
            call GroupRemoveUnit(g, u)
        endloop
    //! endtextmacro
I chose a group of 20 units as my test case. I adjusted the number of times each textmacro was repeated in the Periodic function so that I got the same frames per second as in my reference case of 20000 empty function calls.
Results:
countFC costper unit
Enum filter31645 FC32 FC
ForGroup28714 FC36 FC
FirstOfGroup75266 FC13 FC
The results indicate that group enum filters and ForGroup callbacks are about as costly as trigger evaluations which makes them rather slow. A filter-less enum coupled with a FirstOfGroup loop is almost three times faster.

I did one more batch of tests. I compared how many FirstOfGroup loops I could do with for different numbers of units in the area.
Results:
number of units12351020
number of instances63045035524514075
cost (FC)32445682142267
cost per unit (FC)322219161413
Based on these results, the overhead of enumerating units in an area and then doing a FirstOfGroup loop is 19.7 FC and each unit that ends up in the group costs an additional 12.3 FC.


I also did a test of only doing a ForGroup on a group of 20 units without enumeration. I could run the loop 32 times, which is only marginally more than the 28 I was able to get with the added GroupEnum. The cost of a ForGroup is thus 625 FC or 31 FC per unit. By subtraction, the cost of the GroupEnum is just about 5 FC per unit.

I compared that to a loop through a linked list of units. I could initially do 80 loops through a list of 20 units, but once I tweaked the loop methods on my struct so they would inline properly, I could do 140 loops, which is nearly 5 times as much as I could do ForGroup loops. Of course, a linked list is optimized for looping, but isn't as good for some of the other operations you can do on groups.

During the course of testing, I occasionally repeated my reference test of 20000 function calls on a function with no arguments and a name 10 characters long. As I was testing different numbers of units, I noticed that the target FPS has changed, so the tests I did on the FirstOfGroup loops with different numbers of units were not as accurate. How big was the error? I ran some tests on an empty map and a map with 50 footmen. To get the same FPS, I needed to reduce the number of function calls by 3% when I was testing with 50 stationary units on the map. This isn't such a big deal, in my tests I had 1 to 20 units so my results are still fairly accurate. For fun, I also tested what happened when I ordered those 50 footmen to patrol to a nearby point. The reduction in the number of function calls needed to maintain the same FPS was 5%, so those 50 patrolling footmen cost me 1000 FC.
10-13-2011, 09:10 PM#5
Anitarf
I recently stumbled upon the following statement:
Quote:
Also, in the exitwhen statement, 0==this is faster than this==0.
I was skeptical about this, I didn't really see a reason for why there should be a difference between the two, so I decided to test it.

I tested this claim with the following two test cases:
Collapse Case 1:
        loop
            exitwhen N==0//0==N
        endloop
Collapse Case 2:
        set N=10
        loop
            set N=N-1
            exitwhen N==0//0==N
        endloop
Results:
instancescost per instance
Case 1, N==0280000.7 FC
Case 1, 0==N280000.7 FC
Case 2, N==0105019 FC
Case 2, 0==N95021 FC
The results of the second test case actually suggest that 0==N is slower, but the difference is so small that it's probably due to how the instructions get sorted on my particular CPU, rather than a universal speed difference. I tried changing the order of the exitwhen statement and N=N-1 and got the same results. Considering how small the difference is and the fact that there was no difference on the first test case, the only thing I can reasonably conclude from this is that N==0 and 0==N are equally fast.
10-13-2011, 11:32 PM#6
Fledermaus
How hard is it to actually update the stopwatch natives? Seems like this patch is gonna be the current one for a while.
10-16-2011, 04:11 AM#7
PurgeandFire111
Quote:
Originally Posted by Anitarf
I recently stumbled upon the following statement:
I was skeptical about this, I didn't really see a reason for why there should be a difference between the two, so I decided to test it.

Thank you lol. I never understood the logic of that statement. In a logical sense, they would be equal in terms of speed, which apparently is so.

Quote:
How hard is it to actually update the stopwatch natives? Seems like this patch is gonna be the current one for a while.

I think the source code/dl link of the stopwatch natives is down so it would probably be difficult to do unless someone gets in contact with SFilip.
10-16-2011, 11:07 AM#8
Bannar
Anitarf, could you once again benchmark GroupEnum + filter (performing actions) with GroupEnum + FirstOfGroup enumeration but with differend circumstances?
What I mean is moving those closer to real environtment when user uses multiple comparisons (for unit selection) and with higher unit count. Meaby even with unit-type mix.

I'm just unsure of current results since those are 'perfect circumstances' which to be honest never happen. Jassfags usually, while writing spells/systems implement ton of selective stuff to consider all the options making thier work bugfree.

Thanks in advance.
10-16-2011, 05:22 PM#9
Anitarf
Quote:
Originally Posted by Inferno
Anitarf, could you once again benchmark GroupEnum + filter (performing actions) with GroupEnum + FirstOfGroup enumeration but with differend circumstances?
What I mean is moving those closer to real environtment when user uses multiple comparisons (for unit selection) and with higher unit count. Meaby even with unit-type mix.
The comparisons shouldn't matter since they need to be done for each unit regardless of whether you're doing those comparisons in a filter or in a FirstOfGroup loop. So no matter how much code you add to the loop, the cost of the loop increases by the same amount for both looping methods, so the absolute difference in speed between the two approaches remains the same.
10-16-2011, 07:28 PM#10
Bannar
Anitarf, I somewhat agree in case situation it's similar to the "N==0 and 0==N" issue, although I started to be unsure again after reading random responses. And if you scroll down a bit to Bribe's post, you can see that he is mentioning drop in speed in favour of FirstOfGroup to two times instead of three compared to GroupEnum + filter.

What I thought is that, the more filter code we add (selection stuff) the smaller the difference between GroupEnum/FoG and GroupEnum/filter becomes. Although it's still ends up being two times faster.

You know, those benchmarks have greatly changed the sight on some major things, especialy group enumeration methods.
10-16-2011, 09:28 PM#11
Anitarf
Quote:
Originally Posted by Inferno
Anitarf, I somewhat agree in case situation it's similar to the "N==0 and 0==N" issue, although I started to be unsure again after reading random responses. And if you scroll down a bit to Bribe's post, you can see that he is mentioning drop in speed in favour of FirstOfGroup to two times instead of three compared to GroupEnum + filter.
Yes, he did mention that, but he didn't describe his methodology, he didn't post any test code or raw test results nor any reference tests, so I can't really know what he was measuring in the first place. Did he test raw loops like I did or did he add some unit filters to the loop like others have suggested? I can't know that. Maybe he did do a much better test than I, or maybe he just pulled that number out of his ass.

As for other comments in that thread, yes, I did only do these tests on one computer, it's the only one I have. I'm not sure what people who object to this expect me to do, go out and buy a new computer so I can do more WC3 benchmarks? I don't think so. When I first posted this thread I did invite others to replicate my tests on different hardware so if you feel my results are suspect then by all means, repeat my tests. If you don't then you don't get to complain.

The comment about the op limit may be a valid concern, but there are other ways of avoiding it besides moving some code to enum filters. For example, in a missile system, you could run the periodic update function of every missile that does collision checks with an .evaluate instead of a regular function call. The cost of doing so will easily be offset by using the faster FirstOfGroup loop as long as there are enough units in range (which there should be since the whole justification for why the op limit might be a problem was that missile collision checks could be enumerating large amounts of units).
10-26-2011, 11:00 PM#12
Anitarf
I ran some more tests to verify the feasibility of dummy unit recycling. First, I wrote a simple dummy unit recycling library that maintains multiple queues of dummies at different facing angles so that the dummies are suitable for use as projectiles:
Collapse JASS:
library xedummy requires xebasic

    globals
        private constant integer ANGLE_RESOLUTION = 12
        private constant integer DUMMY_PRELOAD_COUNT = 5 // Not yet implemented.
    endglobals

// END OF CALIBRATION SECTION
// ================================================================

    private struct recycleQueue extends array
        recycleQueue next
        recycleQueue prev

        real angle

        integer size
        xedummy first
        xedummy last
        static method onInit takes nothing returns nothing
            local integer i=0
            loop
                exitwhen i==ANGLE_RESOLUTION
                set i=i+1
                set recycleQueue(i).prev=recycleQueue(i-1)
                set recycleQueue(i).next=recycleQueue(i+1)
                set recycleQueue(i).angle=(i-0.5)*(360.0/ANGLE_RESOLUTION)
            endloop
            set recycleQueue(1).prev=recycleQueue(i)
            set recycleQueue(i).next=recycleQueue(1)
        endmethod

        static method get takes real angle returns recycleQueue
            return recycleQueue(R2I(angle/360.0*ANGLE_RESOLUTION)+1)
        endmethod
    endstruct

// ================================================================

    struct xedummy
        private static group g=CreateGroup()
        private unit u

        // ----------------------------------------------------------------

        private xedummy next
        
        private method queueInsert takes recycleQueue q returns nothing
            call SetUnitFacing(.u, q.angle)

            if q.size==0 then
                set q.first=this
            else
                set q.last.next=this
            endif
            set q.last=this
            set .next=0

            // Check adjacent queues to see if they have fewer dummies than this queue.
            // If they do, move the first dummy from this queue to an adjacent queue.
            // This operation is recursive so it will find the local minimum queue and
            // increase the size of that rather than a neighbouring larger queue.
            if q.size>q.next.size then
                set this=q.first
                set q.first=.next
                call .queueInsert(q.next)
            elseif q.size>q.prev.size then
                set this=q.first
                set q.first=.next
                call .queueInsert(q.prev)
            else
                set q.size=q.size+1
            endif
        endmethod
        
        private static method queueRemove takes recycleQueue q returns xedummy
            // Check adjacent queues to see if they have more dummies than this queue.
            // If they do, move the first dummy from that queue to this one.
            // This operation is recursive so it will find the local maximum queue and
            // decrease the size of that rather than a neighbouring smaller queue.
            local xedummy this
            if q.size<q.next.size then
                set this=q.last
                set q.last=.queueRemove(q.next)
                set .next=q.last
                call SetUnitFacing(q.last.u, q.angle)
            elseif q.size<q.prev.size then
                set this=q.last
                set q.last=.queueRemove(q.prev)
                set .next=q.last
                call SetUnitFacing(q.last.u, q.angle)
            else
                set q.size=q.size-1
                if q.size==0 then
                    set q.last=0
                endif
            endif

            set this=q.first
            set q.first=.next
            set .next=0
            return this
        endmethod
    
        // ----------------------------------------------------------------

        private static method create takes unit u returns xedummy
            local xedummy this
            if GetUnitTypeId(u)!=XE_DUMMY_UNITID then
                debug call BJDebugMsg("xedummy.release error: Method called on a unit of an incorrect type.")
            elseif IsUnitInGroup(u, .g) then
                debug call BJDebugMsg("xedummy.release error: Method called on an already released unit.")
            else
                set this=.allocate()
                set .u=u
                call GroupAddUnit(.g, u)
                call .queueInsert(recycleQueue.get(GetUnitFacing(u)))
                call ShowUnit(u, false)
                return this
            endif
            return 0
        endmethod

        private method destroy takes nothing returns nothing
            call GroupRemoveUnit(.g, .u)
            call ShowUnit(.u, true)
            set .u=null
            call .deallocate()
        endmethod

        // ----------------------------------------------------------------

        private static unit dummy
        static method new takes player p, real x, real y, real face returns unit
            local recycleQueue q
            local xedummy this
            loop
                exitwhen face>0.0
                set face=face+360.0
            endloop
            loop
                exitwhen face<360.0
                set face=face-360.0
            endloop
            set q=recycleQueue.get(face)
            if q.size==0 then
                set .dummy = CreateUnit(p, XE_DUMMY_UNITID, x,y,face)
                call UnitAddAbility(this.dummy,XE_HEIGHT_ENABLER)
                call UnitAddAbility(this.dummy,'Aloc')
                call UnitRemoveAbility(this.dummy,XE_HEIGHT_ENABLER)
            else
                set this=.queueRemove(q)
                set .dummy=.u
                call .destroy()
                call SetUnitX(.dummy, x)
                call SetUnitY(.dummy, y)
                call SetUnitFacing(.dummy, face)
                call SetUnitOwner(.dummy, p, true)
            endif
            return .dummy
        endmethod

        static method release takes unit u returns nothing
            call .create(u)
        endmethod
    endstruct

endlibrary
  1. In the first test, I simply compared how many units I could create/recycle on a periodic 0.01 second timer to get the same fps drop as calling a function 20000 times, same as in my earlier tests. I didn't test only unit creation, I had to also include unit removal, else I wouldn't get stable results as the number of units in the map would increase rapidly. In practice, we also have to eventually remove all the units we create so that's all right.
    Table:
    Code:Count:Cost:
    call RemoveUnit( CreateUnit(Player(15), XE_DUMMY_UNITID, 0.0,0.0,0.0) )131538 FC
    call xedummy.release( xedummy.new(Player(15), 0.0,0.0,0.0) )57351 FC
    We can see from these results that creating (and removing) a unit is a rather costly operation and that recycling it is nearly 5 times faster, possibly more once the map is optimized.


  2. For my second test, I edited xefx to use the library above instead of creating and removing dummy units, then compared how many xemissiles I could run with the edited xefx compared to the original. I used the following test code:
    Collapse JASS:
    library test initializer onInit requires xemissile
        private function Periodic takes nothing returns nothing
            // create a missile with a random target
            local real x=GetRandomReal(-1024,1024)
            local real y=GetRandomReal(-1024,1024)
            local real tx=GetRandomReal(-1024,1024)
            local real ty=GetRandomReal(-1024,1024)
            local xemissile m=xemissile.create(x,y,0,tx,ty,0)
            // calculate the flight distance
            set tx=tx-x
            set ty=ty-y
            set x=SquareRoot(tx*tx+ty*ty)
            // launch with such speed that the distance will be covered in one second
            call m.launch(x, 0.15)
            //set m.fxpath="Abilities\\Weapons\\LordofFlameMissile\\LordofFlameMissile.mdl"
        endfunction
    
        private function onInit takes nothing returns nothing
            call TimerStart(CreateTimer(), 0.01, true, function Periodic)
        endfunction
    endlibrary
    At the start, I created 100 xemissiles per second, each missile lasted 1 second which seemed like a reasonably common value based on my experience from ET. The xe update period was set to 0.025 so each missile went through 40 update cycles in addition to the costs of being created and destroyed. To get the same FPS with the unedited xefx, I had to increase my test timer frequency to 0.0125 which means the number of missiles created per second dropped from 100 to 80. Using dummy unit recycling thus allowed me to get 25% more missiles for the same cost. Keep in mind though that updating xemissiles is relatively cheap, so recycling the dummy units would make less of a difference on a more complicated missile system since those 40 updates would represent a larger proportion of the total cost.
10-30-2011, 12:35 PM#13
cohadar
Table:
if LoadReal(H,N,N)>=0.0 then endif56003.57 FC3.27 FC
if LoadInteger(H,N,N)>=0.0 then endif55003.64 FC3.34 FC

Second statement is slower because it has implicit conversion done by compiler:
if LoadInteger(H,N,N)>=0.0 then ==> if I2R(LoadInteger(H,N,N))>=0.0 then

You should have used integer zero instead of real zero:
if LoadInteger(H,N,N)>=0 then
Quote:
Originally Posted by Anitarf
so if anyone goes around saying how reals are faster than integers based on this I'll have to slap them.
I think reals could really be faster than integers, I think we should all recode all our stuff to use reals just to be sure.
10-31-2011, 05:29 AM#14
Nuclear Arbitor
excellent suggestion, you get right on it
11-01-2011, 11:04 PM#15
Anitarf
Following up on my previous post, I decided to make a comparison between the dummy recycling library I wrote above and MissileRecycler that was submitted recently by Bribe. Since Bribe uses timers and I don't I was expecting his library to be slightly slower, but not significantly.

Before I could benchmark the libraries, though, I had some tweaking to do. Since MissileRecycler's units did not immediately become available for recycling, I had to preload enough of them for my stress test. To avoid creating too many units, I reduced the number of directions to 3, this is a useless number in practice but the speed of recycling shouldn't really change with a higher number so my results are valid. I preloaded 70 units per direction so I had 210 units on my map in total.

During testing, I was surprised that my library was performing significantly worse than Bribe's. It turned out that the culprit was the ShowUnit native: I was hiding my units on the recycle queue and Bribe wasn't. Since there's really no need to hide the dummy units, I removed the ShowUnit calls from my library and repeated the test.

Table:
Code:Count:Cost:
call RemoveUnit( CreateUnit(Player(15), XE_DUMMY_UNITID, 0.0,0.0,0.0) )12.51600 FC
call RecycleMissile( GetRecycledMissile( 0.0,0.0,0.0,0.0 ) )60333 FC
call xedummy.release( xedummy.new(Player(15), 0.0,0.0,0.0) ) with ShowUnit42476 FC
call xedummy.release( xedummy.new(Player(15), 0.0,0.0,0.0) )68294 FC
The results are as expected so I wouldn't really have bothered posting this if not for the interesting observation that hiding and showing a unit, even one with the 'aloc' ability, is quite a costly operation: about 180 FC.