Ticket #185 (new defect)

Opened 8 months ago

Last modified 3 weeks ago

The 0x07 Packet (Ordered to DC)

Reported by: Maverick Assigned to: nobody
Priority: high Component: Core
Version: Latest version from repository Severity: critical
Keywords: Cc:

Description

Updated RoboQueen today and noticed that RoboBoy was still getting kicked.

30 Dec 2007 19:19:31   RoboBoy (purepubbot) is disconnecting: received bi-direct
ional packet 0x07 from server (ordered to DC)

I thought this was caused because the bot is sending too many packets but wasn't the limit lesser when hosted on the same server?

On RoboQueen's previous bots, was this packet handled at all?
Wouldn't we save alot of trouble if we do not handle this packet at all?

Attachments

Change History

12/31/07 10:40:31 changed by dugwyler

Ahh, didn't yet put the code in for that particular option. I'm adding it in now, though (will wait to commit until you respond to this comment though, considering one choice we have to make). It's odd that it would be an issue: the default limit is 100 packets per 75ms clustered send, and as you said it should be able to send them out fairly quickly from a local connection. Maybe there's still a limit, though. Did the PurePubBots randomly disconnect on occasion in the past?

The packet has always been handled, because if the server sends the order, I'm pretty sure afterward it will ignore all data coming from that client until it sees that it's disconnected. The two instances in which this occurs, to my knowledge, are when the server is not receiving its required ACKs and various required responses, and when a packet flood is detected. Not sure which one it is in this case — or if there might be a third case that we can't figure out. To make sure the new code wasn't blocking any important data being sent out, I checked which methods call sendChatPacket (which is the only method that adds low priority/capped packets), and it should be safe there.

One possibility might be that local connections have the exact same upper packet limit as others, and are just as likely to get kicked as remote connections. Of course they have no latency or loss, so there's still an advantage to hosting bots locally if this is the case, with the chance of them timing out being reduced. (In fact some of the King crashes we see may be due to the entire botcore just lagging out.) I do remember Myth saying that he implemented the message spam limiter for MatchBot — but then MatchBot runs (and presumably at that time, ran) on Queen, so it may still be necessary/useful.

Assuming this may be the case we could implement a default message limit of 40 chat packets per 75ms clustered send. This seems to work with DistensionBot.

01/02/08 01:09:29 changed by Maverick

I believe local connections have the same limit as others and are just as likely to get kicked as remote connections. If there would be no limit then there is not much reasoning as to why the server would request the purepubbbot and matchbots to disconnect.

Wondering: the server gives off a server error when a packet flood is detected from normal players. However, when bots disconnect there is no such server error given. This might simply be because the bots are sysops and therefore no errors are given or is there something else going on?

Wouldn't limiting messages (talking about a default message limit) unnecessary slow down bots as we're still unsure if PMs are causing it?
Former Enigmabot was/is also experiencing the same problems as it made the server recycle (just like Distensionbot), however this bot doesn't send alot of PMs. It does *objons and changes arena settings when the game is run.
Hmmm is TWCore sending *objons as PMs? If so, this might be improved if we adopt Mervbot's way of toggling LVZs - it does this by sending specific LVZ toggling packets to the server instead of a PM.

I've run over the changesets done under twcore.core and found the following "recent" changesets which might be causing excessive packet load:

  • [2025] & [2026]: Your limiter which we talked about
  • [1781]: Tracking player position by default after entering arena. I would expect this to be making the bot send/receive lots of excess packets. However, I'm not sure if we're still doing this as it's an old commit. I would rather stop tracking players by default as we're now hitting packet thresholds. (Just to be on the safe side.)

I hope we resolve this as soon as possible as I fear TWD and TWL are suffering under this problem aswell.

01/02/08 01:11:54 changed by Maverick

  • owner changed.
  • priority changed from normal to high.
  • component changed from Bots - Other to Core.
  • severity changed from minor to critical.
  • summary changed from RoboBoy - purepubbot - getting kicked to purepubbot and matchbot getting kicked.

01/02/08 06:18:59 changed by dugwyler

Don't have time to reply to the whole thing at the moment, but here is the MERVBot handling of an obj toggle:

clientMessage *generateObjectToggle (Uint16 player, objectInfo *objects, Uint16 num_objects) { clientMessage *ret = new clientMessage(4 + 2 * num_objects);

if (ret == NULL) return NULL; char *msg = ret→msg;

/* Field Length Description

0 1 Type byte: 0x0a 1 2 Player ident (-1 = all players) 3 1 LVZ type byte: 0x35

The following are repeated until the end of the message

4 2 LVZ bitfield

*/

msg[0] = 0x0a; *(Uint16*)&msg[1] = player; msg[3] = 0x35;

memcpy(msg + 4, objects, 2 * num_objects);

return ret;

}

01/02/08 06:43:05 changed by Maverick

 clientMessage *generateObjectToggle (Uint16 player, objectInfo *objects, Uint16 num_objects)
 {	clientMessage *ret = new clientMessage(4 + 2 * num_objects);
 	if (ret == NULL) return NULL;
 	char *msg = ret->msg;
 
  /*Field Length  Description
    0     1       Type byte: 0x0a
    1     2       Player ident (-1 = all players)
    3     1       LVZ type byte: 0x35
  The following are repeated until the end of the message
    4     2       LVZ bitfield
 	*/
 
 	msg[0] = 0x0a;
 	*(Uint16*)&msg[1] = player;
 	msg[3] = 0x35;
 
 	memcpy(msg + 4, objects, 2 * num_objects);
 
 	return ret;
 }

* :-P

01/02/08 10:02:33 changed by dugwyler

Yeah, sorry, was rushing off and realized only after that I hadn't formatted it.

I've pretty much got object toggling using the C2S 0x09 packet implemented (with help from MERV and Hybrid). The size of one of these packets is 4 + 2*#objs, whereas a *objset is 16 + (min 6 for single-digit obj#, up to 14 for 5 digits)*#objs. For an individual message-sent *objon you're looking at 20 bytes minimum, up to 28, rather than 6 regardless of the ID when sent with this packet. Should be a good way to save on some unnecessary net traffic. Adapting the present implementation may be somewhat annoying, however, in attempting to properly use the new way of toggling. There has to be a balance between toggling the messages quickly and trying to package as many as possible into a single packet. For DistensionBot in particular I might write a separate handler.

One thing I also noticed in MERV's clientprot.cpp is the handling of C2S 0x1D — a protocol-based "change settings" packet. Handling this could be the solution for Engima's problem of sending settings changes as text messages. I'm sure the server more tolerant of this method, as the changes are all sent in one packet. It's a shame this stuff isn't in D1st0rt's packet list, as that's basically the most complete non-coded packet list I've found yet. But the ominous "Unknown" makes it seem like these actually are not known… Guess he just hasn't had the time to update it. (Other unknown with implementation: 0x21 is to generate a soccer goal) Unfortunately it looks like there are no unknown packets that can be used to send prizes.

Regarding your post:

Yeah, looks as though from the DCs we're seeing, local and remote bots are both subject to the same limits.

Not sure about the server error and bots/sysop. I think I was able to at one point get it to notice a packet flood from King's IP, but more often than not when I had it prize me in excess, it would say that I was doing the packet flooding — probably from a large amount of returned ACK messages. Do we log sysop warning messages anywhere? If not, should we be doing so? One of the bots used to keep a zipped log archive going, but I'm not sure if that's continued anymore. At the very least keeping an eye on some of the sysop warning msgs could be helpful.

The reason I think that too many PMs (or any kind of message) cause the disconnection/recycling problem is due to repeatability. I had !msgbeta PMing about 300 players at once, and it crashed the server several times in a row, instantly after I issued the command. Also a problem was having 30-40 people in the arena and then in the ArenaJoined event PMing everyone in the arena with 3 separate PMs. This one reproduced a couple of times as well before it was fixed. You'll notice that now, the server isn't recycling, but instead the bot is getting DC'd. I think this is a result of the server being able to get in the DC request before it is forced to recycle — something that wasn't possible before limiting packets (or controlling 'spams' … DistensionBot queues up various command text spams rather than sending all at once).

As for the limit unnecessarily slowing down bots, I don't think this is the case. DistensionBot was able to run with absolutely no lag with a packet cap of 40. (That was when the server lag wasn't bad.) And it easily sends out more messages than any other bot, hands down. Keep in mind a cap of 40 means 40 messages every 75ms, or approximately 533 messages a second. If there is a bot that's sending out more than 500 chat/PM/prize messages in a particular second, it seems prudent to throttle it a bit until it gets over that slump. I don't think it's really slowing anything down that shouldn't be. (Also keep in mind that only DistensionBot has that cap set in that particular way right now. All other bots presently have a cap of 100 per 75ms.)

UPDATE: I just looked over the limiting code, and it's actually set up so that it will never cap anything! An oversight, getting two variable names switched. Whoops. That means a lot of what is being seen may just be lag, or more precisely, what we were talking about yesterday in the park: the bot sending out too many messages, players lagging and not receiving the messages, the bot resending the message creating twice as much traffic and also not using packet clustering for each unsent message — imagine a spike and 50 players need 5 PMs each resent to them … 250 single message packets, which would be quite enough to exceed the limit and cause a DC…

Also: the player position updating uses a minuscule amount of bandwidth. At most we are looking at 3 bytes C2S per switch of a player. Also, regardless of the person the bot is spec'ing, it's going to receive PlayerPosition packets whether it wants them or not. The "update" is just code for "changing spec target." And by default in setup.cfg, the change of targets (at least on King) is done once every 5 seconds. This is pretty inconsequential, and for the staggering amount of functionality lost versus the very little required to do it I don't think it's worth it at all. If anything, I would suggest we just opt to include it in the packet cluster (currently it's not for some reason). This will ensure that it has little chance of raising the packet count. I'll include this in the next commit of the GPG along with the packet limiter fix.

01/03/08 06:55:42 changed by Maverick

We should keep in mind that the packet 0x07 is used for more things then only kicking a client when it's using too much packets (you might already know this).

Today when I was locally testing, I started TWCore with a hub name that wasn't on any staff files (AllowVIEClients=1 and no sysop password login). It logged on successfully but would get kicked after a few seconds:

=== Loading bots ...   ===
03 jan 2008 15:48:30   Hub (HubBot) is logging in ...
03 jan 2008 15:48:30   Hub: Successful password packet response
03 jan 2008 15:48:30   Hub (HubBot) logged in: 63 ms.
03 jan 2008 15:48:41   Hub (HubBot) is disconnecting: received bi-directional packet 0x07 from server (ordered to DC)
03 jan 2008 15:48:41   Hub (HubBot) disconnected gracefully.

After specifying a correct sysop password and putting the name on sysop.txt, the bot was able to stay on.

Just something I noticed today regarding the 0x07 packet :)

01/04/08 13:28:38 changed by Maverick

I found the reason why bot's are getting the 0x07 packet but I don't know how to read it. I checked the server (subgame.log) and matched it with the time when bots get the 0x07 packet:

TWCore console:

04 Jan 2008 13:28:00   RoboBot4 (multibot) is disconnecting: received bi-directional packet 0x07 from server (ordered to DC)

subgame.log

Fri Jan 04 13:28:00:  Reliable packet lost(34) to: RoboBot4 (pending=36029 current=36284 diff=255)

Another example:

TWCore console:

04 Jan 2008 07:01:43   RoboBoy (purepubbot) is disconnecting: received bi-directional packet 0x07 from server (ordered to DC)

Subgame.log:

Fri Jan 04 07:01:43:  Reliable packet lost(14) to: RoboBoy (pending=14614 current=14869 diff=255)

Now.. what does it mean..

01/04/08 13:40:42 changed by Maverick

It seems that the bot doesn't properly send an ACK on a specific reliable message.
However, it does seem to suggest that the bots don't get a 0x07 packet if they are packet flooding.

01/04/08 13:44:50 changed by Maverick

Unfortunately there was no error of Distensionbot in the log as it probably didn't ran the past few days.

01/04/08 13:52:13 changed by Maverick

More information on this error message: http://forums.minegoboom.com/viewtopic.php?t=2953

hm might still be about packet flooding or at least about sending/receiving too much reliable packets.

01/04/08 15:38:43 changed by dugwyler

Nice findings … this is a big step forward on this issue. Looks like the lag may be a big part. Any more word from anyone in the know about what is going on?

I hope that the packet queuing (now actually enabled) will help curb this problem. I'll commit it soon along with the LVZ stuff. Just a lot to do presently.

You said that the bot doesn't properly send an ACK on a specific message — do you have more info? I took a look at the reliable message receiving system, but couldn't find too much that is odd. The only slightly questionable thing is sendAck(int) in the GPG:

    public void sendAck( int ackID ){
        ByteArray      bytearray = new ByteArray( 6 );

        bytearray.addByte( 0x00 );  // Type byte (specifies bidirectional)
        bytearray.addByte( 0x04 );  // Bidrectional: Reliable ACK
        bytearray.addLittleEndianInt( ackID );  // ID

        if( m_ssEncryption != null ){
            composePacket( bytearray );
        } else {
            composeImmediatePacket( bytearray, 6 );
        }
    }

The SS encryption class should not be null, so we're generally running composePacket. But, composePacket adds the specified packet to the queue of packets to be sent, and as a reliable message (requiring its own ACK). It might not happen extremely frequently, but if the ACK is the only packet in the queue, then we're sending a request to ACK our ACK… which is of course completely ridiculous, and just generates unnecessary net traffic.

Maybe in the case of just having a single packet in the packet queue, we should check the packet's first two bytes against the little endian short 0x0400 (header for ACK). (Is that right? It's been a while since CS undergrad … but if the first byte is 0x00 and the second is 0x04, that'd make them when read as a little-endian short 0x0400?) If it shows up, we'll send the packet unreliably. Sound OK?

01/04/08 16:25:35 changed by Maverick

The ACK part of mine was just a hunch considering the server reports he lost a reliable message to the bot (which I assume the server never got an ACK to it). However, that the server doesn't receive an ACK can be caused by many things, outlined in the topic on MGB forums I linked.

Yea sounds OK (had to read it a couple of times to understand).

btw, had a short talk with Mythrandir about this as I thought he would know something about this subject too:

C 2:MMaverick> myth, can you shed your light on this issue: http://www.twcore.org/ticket/185 ?
C 2:MMaverick> we need every help we can get with it
C 2:MMaverick> and dugwyler told me you encountered it before
C 2:Mythrandir> yeah and dug too
C 2:MMaverick> I just found out about the server error
C 2:MMaverick> makes a few things more clear but still very confusing
C 2:Mythrandir> I wish I knew but I don't know what exactly is causing it. it definitely happens more often when the bot is doing a lot of stuff
C 2:MMaverick> something to do with the reliable packets I think

I know, doesn't help much :-P

(follow-up: ↓ 15 ) 01/05/08 11:22:59 changed by dugwyler

Interesting findings with the ACK being sent reliably: from a brief test on my home core with a message printing out when it occurs, this is done about once a second(!) from every bot. Apparently ACKs make up a lot of the outgoing net traffic. I've switched to sending them unreliably, as a normal implementation of the SS protocol would, and it causes no problems. The only tricky piece is, if the ACKs are sent unreliably and are actually lost (very small though they are), the server will count the packet as lost and will resend it: this will count against the 255 max pending allowed before it kicks. However, since the server will not have to send an ACK back for every ACK, it will have more bandwidth to be able to send more packets that matter. Meanwhile if there is S2C packetloss that causes the "ACK of an ACK" from the server to get lost, it will not cause the bot to get slowed down in processing waiting for it. Presently, if reliable packets are received out of order nothing is processed until the ACK for the one that comes next is received … Basically what this means is you could have several commands build up while we're waiting for this totally unnecessary "ACK of an ACK". Once the ACK is finally received, the packets building up are processed all at once. If they were things that require a lot of messages sent out at once, then it could result in a flood of packets being sent out. Not that this would necessarily be bad … but it might generate events that would result in reliable packets being sent from the server to the bot, more ACKs of ACKs in repsonse — and so on!

After another test the LVZ stuff seems to be working great. I'll commit it along with this ACK handling change… hopefully the combination will help. Also the packet limiter is enabled now, but with a pretty high limit by default. May want to trim it down, or on bots that send out quite a bit elect for a manually set value. (Could also add this as a setup.cfg option.)

(in reply to: ↑ 14 ) 01/05/08 13:55:18 changed by Maverick

Like I already ?messaged you, good work!
I really hope this solves (most) of our problems. :D

Replying to dugwyler:

I've switched to sending them unreliably, as a normal implementation of the SS protocol would, and it causes no problems.

Does that mean TWCore made a mistake implementing the SS protocol ?

Just updated RoboKing, no problems so far (but that also didn't happen on the old version).

01/05/08 14:47:41 changed by Maverick

I had to shut down RoboQueen anyway so I upgraded RoboQueen at the same time, we'll see how it goes.

01/05/08 15:57:22 changed by Maverick

funny, RoboBot1 was disconnected using the 0x07 packet but I remembered that this was caused because RoboQueen spawned another multibot with the same name :)

01/05/08 16:16:29 changed by Maverick

Started RoboQueen om 17:47 it's now 19:17 and no problems so far.
Seems the fix seems to be doing (very) well (for now).

01/06/08 13:01:49 changed by Maverick

Until now still have had only once case of 0x07 disconnection:

06 Jan 2008 04:03:57    MatchBot2 (matchbot) is disconnecting: received bi-directional packet 0x07 from server (ordered to DC)
Sun Jan 06 04:03:57:  Reliable packet lost(53) to: MatchBot2 (pending=159086 current=159341 diff=255)

this is far far better from what we have had before.

Maybe the MatchBot's aren't using the new LVZ functionality?

01/06/08 16:53:27 changed by dugwyler

Everything by default should use the new functionality… as far as using it as efficiently as it could be used, though, depends mostly on whether Objset is used. If multiple objon or objoff BotAction operations are performed back to back, multiple packets are being sent when they could be wrapped as one packet that would require only 1 ACK as opposed to several. That's all I can think of there.

One disadvantage to the new code remains in that ACKs may get lost… if we have awful C2S packetloss but no S2C, the new code will actually not work as well as sending ACKs reliably. On the whole it should be much improved, though.

01/13/08 05:48:28 changed by Maverick

So far so good. Until now I've only seen TWDevChat2 and MatchBot6 disconnect with the 0x07 packet.

RoboBot2 and RoboBot3 also disconnected at the same time but this was after some NPE of the trivia module:

java.lang.NullPointerException
        at twcore.bots.multibot.trivia.trivia.getMinTimesUsed(trivia.java:524)
        at twcore.bots.multibot.trivia.trivia.grabQuestion(trivia.java:494)
        at twcore.bots.multibot.trivia.trivia$1.run(trivia.java:147)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
java.lang.NullPointerException
        at twcore.bots.multibot.trivia.trivia.grabQuestion(trivia.java:499)
        at twcore.bots.multibot.trivia.trivia$1.run(trivia.java:147)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Exception in thread "Timer-35" java.lang.NullPointerException
        at java.util.StringTokenizer.<init>(StringTokenizer.java:182)
        at java.util.StringTokenizer.<init>(StringTokenizer.java:204)
        at twcore.bots.multibot.trivia.trivia.grabQuestion(trivia.java:514)
        at twcore.bots.multibot.trivia.trivia$1.run(trivia.java:147)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
java.lang.NullPointerException
        at twcore.bots.multibot.trivia.trivia.getMinTimesUsed(trivia.java:524)
        at twcore.bots.multibot.trivia.trivia.grabQuestion(trivia.java:494)
        at twcore.bots.multibot.trivia.trivia$1.run(trivia.java:147)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
java.lang.NullPointerException
        at twcore.bots.multibot.trivia.trivia.grabQuestion(trivia.java:499)
        at twcore.bots.multibot.trivia.trivia$1.run(trivia.java:147)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Exception in thread "Timer-36" java.lang.NullPointerException
        at java.util.StringTokenizer.<init>(StringTokenizer.java:182)
        at java.util.StringTokenizer.<init>(StringTokenizer.java:204)
        at twcore.bots.multibot.trivia.trivia.grabQuestion(trivia.java:514)
        at twcore.bots.multibot.trivia.trivia$1.run(trivia.java:147)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
10 Jan 2008 21:31:03   RoboBot2 (multibot) is disconnecting: received bi-directi
onal packet 0x07 from server (ordered to DC)
10 Jan 2008 21:31:03   RoboBot2 (multibot) disconnected gracefully.
10 Jan 2008 21:32:13   RoboBot3 (multibot) is disconnecting: received bi-directi
onal packet 0x07 from server (ordered to DC)
10 Jan 2008 21:32:13   RoboBot3 (multibot) disconnected gracefully.

This NPE I fixed in changeset:2136.

I think Matchbot can be fixed by compacting the LVZ usage for the scoreboard (using the new LVZ methods).

Overall the bots have become way more stable as far as I can see.

01/13/08 06:11:52 changed by Maverick

Hm this looks like just a glitch in the connection:

13 Jan 2008 08:06:37   Mr. Arrogant 2 (pubarrogant) is disconnecting: received bi-directional packet 0x07 from server (ordered to DC)
13 Jan 2008 08:06:37   Mr. Arrogant 2 (pubarrogant) disconnected gracefully.
Sun Jan 13 09:02:23:  Reliable packet lost(19) to: Mr. Arrogant 2 (pending=312435 current=312690 diff=255)
Sun Jan 13 09:02:25:  Reliable packet lost(19) to: Bajki (pending=3228 current=3483 diff=255)
Sun Jan 13 09:02:27:  Reliable packet lost(19) to: Swaybar (pending=1968 current=2223 diff=255)
Sun Jan 13 09:02:28:  Reliable packet lost(19) to: WAKiziRU (pending=1479 current=1734 diff=255)

When checking the log I found out alot of normal players also sometimes have the "Reliable packet lost" error, like above. This may just be caused by lag the server was having.

04/30/08 16:08:49 changed by milosh

  • summary changed from purepubbot and matchbot getting kicked to The 0x07 Packet (Ordered to DC).

Changing name of ticket.

07/31/08 07:19:10 changed by milosh

  • owner set to nobody.

Going for consistency.


Add/Change #185 (The 0x07 Packet (Ordered to DC))




Change Properties
Action