-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection issues encountered with CTCE devices #369
Comments
Sorry, my file names in the zip did not match my commentary. I have corrected that in this file. For reference: JS05 = hub, JWS1 = peer1 and JS02 = peer2. |
I cannot explain what is going on or why. Peter Jansen (@Peter-J-Jansen) is the person to help you. He wrote our CTCE support and knows it inside and out. What I will say is this:
Hercules CTCE documentation clearly states:
Complaining that "ports below 1024 should be allowed!" does not somehow magically change Hercules's existing already written code to somehow magically allow it. Whether you like it or not (whether you agree with it or not), you must nonetheless abide by the rules (constraints) that CTCE imposes upon you. I would personally try fixing your port number problem first before reporting any type of Hercules problem. Valid or not, Hercules does currently require CTCE port numbers to be >= 1024, so regardless of whether you like that or not (regardless of whether you agree with that or not), you must nevertheless abide by it, or things are obviously not going to work correctly. Fix your port problem and try again. THEN if things still do not work correctly, then you have a valid problem to report. Otherwise your "problem" report IMHO is invalid. But I will let Peter decide whether to accept your problem report (not recommended), or whether to reject and close it as "User Error" (recommended). Peter? |
Fish, If you'll read the whole report, you'll see that the port number issue was only one of 3. In the other two issues, I used a port number above 1024 in the configuration and the port number played no part in the major problem, i.e. issue 2 - port numbers moving from one device to another. Peter, I had not noticed the port limitation in the documentation, and while I may not agree with it, I will obviously abide by it. Please disregard issue 1 in my report and look at issues 2 and 3. Thanks! |
Hi Jeff, Glad to see the Hercules CTCE's are put to good use! Concerning the issues you encountered:
600 CTCE $(IMAGE)600 530=192.168.1.32 05001 ATTNDELAY 200 # VTAM link to JS05 (SNA to OS/390 2.10) 601 CTCE $(IMAGE)601 531=192.168.1.32 05002 ATTNDELAY 200 # VTAM link to JS05 Should that not be, noting 05000 and 05001: 600 CTCE $(IMAGE)600 530=192.168.1.32 05000 ATTNDELAY 200 # VTAM link to JS05 (SNA to OS/390 2.10) 601 CTCE $(IMAGE)601 531=192.168.1.32 05001 ATTNDELAY 200 # VTAM link to JS05 This could explain the errors you described under item 2. As you explained that changing to unique device address (devnum's) fixed your problem, did you perhaps also correct this "rport" number error?
When an incoming CTCE connection is received by the CTCE listener, a decision must be made whether it matches a connection waiting to be connected to. That decision is made using the remote IP address ("raddress") of the incoming connection attempt, combined with the remote device number "rdevnum", the preferred method. (Alternatively, the older method uses the remote port number ("rport") when "rdevnum" is not specified. Good to no longer use that method by always specifying "rdevnum", or at least the equal sign (=) prior to the "raddress"). A feature which I nearly always use is the exclusive-or operation on "rdevnum" so that each Hercules side uses the even devnum addresses for reading, and the odd ones for writing (or the other way around). The resulting CTCE configuration could then for example be:
When using connections between different hosts and thus different IP addresses, the port numbers ("lport" and "rport") can be omitted completely. Same host Hercules instances though will need it. Jeff, I hope this helps. I would like to close this issue, but will await your OK to do so. Thanks, and let's stay healthy! Cheers, Peter |
Peter said:
Regardless of whether there was a user configuration error or not, Hercules should NOT be crashing. It looks to me like you have a locking problem somewhere: 21:40:29 HHC05076I 0:0530 CTCE: Connection closed; 0 MB received in 0 packets from 0:0600=192.168.1.32:2000/64011; shutdown=0 21:40:29 HHC05086I 0:0530 CTCE: Recovery is about to issue Hercules command: DEVINIT 0:0530 21:40:29 HHC05076I 0:0530 CTCE: Connection closed; 0 MB received in 74 packets from 0:0600=192.168.1.32:2000/64011; shutdown=0 21:40:29 HHC05086I 0:0530 CTCE: Recovery is about to issue Hercules command: DEVINIT 0:0530 21:40:29 HHC05081I 0:0530 CTCE: Already awaiting connection : 5000 <- 0:0600=192.168.1.32:1600/* 21:40:29 HHC02245I 0:0530 device initialized 21:40:29 HHC05081I 0:0530 CTCE: Already awaiting connection : 5000 <- 0:0600=192.168.1.32:1600/* 21:40:29 HHC90013E 'initialize lock(&dev->ctceEventLock)' failed: rc=17: already init'ed; tid=00008808, loc=ctcadpt.c:3530 21:40:29 HHC00007I Previous message from function 'loglock' at hthreads.c(104) 21:40:29 HHC90028I lock &dev->ctceEventLock was already initialized at ctcadpt.c:3530 21:40:29 HHC02245I 0:0530 device initialized 21:40:29 HHC05054I 0:0530 CTCE: Started outbound connection :64488 -> 0:0600=192.168.1.32:1600 21:40:29 HHC05054I 0:0530 CTCE: Renewed outbound connection :64488 -> 0:0600=192.168.1.32:1600 [...] 21:42:57 HHC01603I locks held sort tid 21:42:57 HHC90017I Lock 00000000017b4350 (&sysblk.ioqlock) created by 00008a4c (panel_display) on 21:24:08.982615 at impl.c:862 21:42:57 HHC90029I Lock 00000000017b4350 (&sysblk.ioqlock) obtained by 00000eec (idle dev thrd) on 21:41:44.644110 at channel.c:2473 21:42:57 HHC90017I Lock 0000000001819f90 (&logger_lock) created by 00008a4c (panel_display) on 21:24:08.982848 at logger.c:484 21:42:57 HHC90029I Lock 0000000001819f90 (&logger_lock) obtained by 000048f8 (logger_thread) on 21:42:57.085613 at logger.c:383 21:42:57 HHC90017I Lock 0000000004f05038 (&dev->lock 0:0531) created by 00008a4c (panel_display) on 21:24:09.056089 at config.c:657 21:42:57 HHC90029I Lock 0000000004f05038 (&dev->lock 0:0531) obtained by 00006f44 (CTCE 0531 RecvT) on 21:41:38.807221 at ctcadpt.c:2719 21:42:57 HHC90017I Lock 000000000177da08 (&cckdblk.gclock) created by 00008a4c (panel_display) on 21:24:09.337180 at cckddasd.c:60 21:42:57 HHC90029I Lock 000000000177da08 (&cckdblk.gclock) obtained by 00009084 (cckd_gcol) on 21:41:42.463992 at null:0 21:42:57 HHC01603I threads waiting sort tid 21:42:57 HHC90023W Thread Processor CP00 tid=0000788c waiting since 21:41:38.898873 for lock &dev->lock 0:0531 = 0000000004f05038 21:42:57 HHC00007I Previous message from function 'threads_cmd' at hthreads.c(1614) I just wanted to point out that regardless of the cause, Hercules should never be crashing. Jeff said:
I apologize for the misunderstanding, Jeff. But the issue is now moot: Peter has identified the cause for your problems, just as I knew he would. Take care both of you. (Jeff? Please close this issue whenever you feel comfortable to do so, OR let us know why you feel it should remain open. Thanks.) |
Hi Peter,
Yep, I'm getting a lot of use out of it. Thanks for creating it!
No problem. Hey, it's your tool, so your rules. :) It just ruined a great port numbering scheme I had going... :(
Good catch! You are correct, I must have introduced a typo in my last day's testing.
I did correct it without even realizing it when I switched to unique device numbers. Unfortunately, once I switched the device numbers back using the corrected port numbers, I still get the same errors. Using shortened configuration files and starting the images with no "guest" OS IPLs, I still see the connection crossovers. (Note: the configurations and full log files are in the attached zip file at the end.)
Results in JS05:
Results in JWS1:
Results in JS02:
And, like before, I encountered shutdown problems due to the crossovers.
Yeah, that's what I figured. I mostly included this just for awareness and so you'd know there was a dump, in case it might prove useful.
In looking at your suggested configuration, it doesn't use the correct remote device numbers for JWS1 or JS02. I set it up and when I tried it, none of the instances even seemed aware of each other. I've since tried a couple variations on it, trying to get it to work with non-matching local and remote devnums, without success. The remote device number does not increment for me. Could you provide an example that uses local device numbers of 530-531 and remote device numbers of 600-601? Once I get that going, I should be able to extrapolate my other links from it. Thanks! Hi Fish, Unfortunately, I'm still experiencing the problem, so I'm not ready to close this issue yet. Please leave it open until for now. Thanks! |
No problem! You are, as you said, still experiencing problems, so we will label this issue as a "Bug" and keep it open until it is resolved. |
Please compress the dump and FTP upload it to my "incoming"
and then send me an email letting me know when you have done so. I will then download it and analyze it to try and determine what went wrong. Thanks!
In fact, you will not be able to even list the contents of the folder (i.e. you will not be allowed to do a "dir" or "ls" of the folder's contents) since that necessarily requires read-access, and as just explained, the folder does not have read-access, only write access, so you can be assured whatever you upload to there is 100% secure. Only I will be able to download from that folder since only I am the administrator. |
Jeff and Fish, Thanks for your follow-up comments. Jeff certainly uncovered what I consider 2 bugs :
The above omits JS05 lport specifications in favor of using the 3088 default, as well as the JWS1 and JS02 rports (thus also defaulting to 3088), but that is not important, they can still be specified as well. Also, I've used "localhost" instead of actual IP addresses, but that is immaterial as well. The only important thing is leaving out the remote devnum specifications I think I should correct this by ensuring that a remote CTCE incoming connection should always include a matching remote port number, whether specified or defaulted (3088), and never rely on a matching remote devnum alone. That should actually be simpler to fix than the 1st bug. Fish wrote about another possible issue :
That might be correct, although I think that in Jeff's case it's caused by the 2nd bug. At least with the circumvention for that bug as shown above, I could no longer reproduce it. All 3 Hercules instances closed down correctly, in whatever order I tried that. That being said, I do recall that during CTCE development, I did encounter sometimes CTCE lock issues as a result of either manual Cheers, Peter |
Hi Peter, Regarding the issue of trying to properly match up identically numbered devnums with multiple Hercules instances on the same host, I have a different suggestion. What if you disassociated the devnum value from the actual physical devnum and instead treated it as a symbolic identifier? This means instead of specifying 530=localhost, specify something unique, like HOST05=localhost, or in Jeff's case, JS05530=localhost. In other words, use the value as a name to provide uniqueness rather than tying it to an actual device number (which may not be unique with three or more Hercules instances on the same host). The chosen name could be anything the user wants that helps him identify the connection to both sides. It also be a CDRM name, or NJE node name, or whatever helps to identify which connection is intended to go where. This of course means the other side's configuration must have that same name coded as well. This would most probably mean you would need an additional configuration parameter on the CTCE statement. This method is exactly what the TCPNJE 2703 device uses. It's parameter specifications include an RNODE= and LNODE= values. Most of us use the actual NJE node names on each end for these values, because they self document. But it is not required to be the NJE node names. It is perfectly ok to code RNODE=A and LNODE=B while the actual NJE node names used on that connection are something else entirely. TCPNJE uses the A and B values to associate the right connection only and has no bearing on the actual data traffic that will flow. By disassociating the devnum from the actual devnum, users could still code the devnum value if they wish - it is now just a symbolic name. But in cases where further uniqueness is required, this offers a way to specify it. Perhaps something like this could be used to resolve the duplicative devnum problem? Regards, |
Hi Bob, Thanks for your suggestion on how to differentiate multiple Hercules instances on the same host. Originally, in the first CTCE implementation (which I've been referring to as CTCE v1), the only method was based on IP Address and CTCE listening port number. With many CTCE connections, managing those listening port numbers become cumbersome, hence that the second implementation (a.k.a. CTCE v2) these port numbers can be replaced with the devnums. This comes much more natural for us configuring the OS's using those devnums. But, as I specifically wanted to continue supporting the CTCE v1 approach, the port number identification is still working an supported. So, effectively, there still is the possibility to use that method, and my tests as explained in my comments to Jeff, do work; I tested them. OK, the identification with using port numbers instead of say NJE node names, is indeed cumbersome, and needs to be carefully specified, as each port number on a given host must be unique. But it does work. The problem as experienced by Jeff is that, if one uses that differentiating / identification port number method, currently the devnums must be omitted completely. That's the workaround I tested and provided to Jeff. The fix for that problem I have already coded, and will be testing the next few days. The beauty of that fix is that it is trivially simple. Whether I should effectively provide an other, additional differentiating / identification method, e.g. using additional parameters like your suggested RNODE= and LNODE=, specifying NJE node names which would need to be unique, I am a bit worried about based on the complications, and the effort required to test it all, as well as the continued support for the current port number method. And all that because of Hercules instances running on the same PC (windows or linux or macos). I'd rather not add that complexity, but yes, I admit, I'm a bit lazy. But the fix so that the devnums can be left, so that Jeff's original configuration will work, that I believe is an easy thing for me to do. But please feel free to contradict me Bob! Cheers, Peter |
Were you able to reproduce it without the circumvention? (i.e. without your fix? i.e. with stock v4.3?) I am unable to try doing so myself due to not having VM/SP 5 (and/or whatever other guest operating systems are involved). If I could reproduce it on my own then I could properly look into it. Since I can't, I cannot.
10-4. Please keep me informed of your progress and PLEASE let me know if you need any help. As I said earlier, no matter what "goes wrong" in Hercules, it should not ever crash! (or hang, etc...) Thanks. p.s. No rush! |
Hi Fish, If you look back at my most recent prior post, you'll see that the stripped down configurations I used to recreate the issue have no DASD in them and no OS IPLs were performed. You should be able to recreate the issue with those configs. |
Hi Peter, Thanks! Based on your suggested configuration changes, I have everything up with duplicate devnums and no issues. I've IPLed all the various affected OSes and confirmed that communication works across all the links. I used the configuration as written (i.e. leaving out the lports in JS05). Once that worked, I recreated my desired 6 instance configuration and tested that using the same format. The only deviation I made was to eliminate localhost. I've been burned too many times by Windows resolving that to an IPv6 address. I brought everything up and after changing the OS configurations back to the original, duplicate device numbers, everything is up and working great. Thanks for your help! Please let me know if I can help test anything as you work through solutions for the issues we discovered. |
For your most recent prior post, yes, that is true, but it was only in your original report that the watchdog thread on JS05 detected that Processor CP00 was hung (and automatically created a crash dump as a result), and in that particular specific instance, a guest operating system was indeed IPL'ed (VM/SP 5 from device 1C0 in this particular case). Neither of the other two system were IPL'ed, true, but system JS05 certainly was, and since that is where the problem is that I am wanting to research (hung processor), I thus need either a copy of VM/SP 5 to be able to IPL on JS05, or else some other guest operating system in order to be able to recreate your hang (but I would feel more comfortable using the same operating system that you were using that caused the problem in the first place).
I don't think so. Looking at the configs and logs from your most recent prior post's attached file, none of systems appear to have been IPLed. The only way to recreate your original Processor CP00 hang (and resulting crash) as reported in your original post, is to use the exact same configs as provided in your original report, and to IPL the same guest that you did: VM/SP 5. (Or, as explained, some other guest operating system that, when IPLed, is also able to recreate the problem/hang.) |
I too make it a habit of never using "localhost". Instead, I always use "127.0.0.1", which is essentially the exact same thing. |
Hi Fish,
OK. I wasn't clear on what you were trying to recreate. Hercules hangs in both cases, but since there was no IPL to activate a CP in the second case, there is no CP hang. I thought you were just trying to recreate the hang. To be honest, I don't know if it would eventually crash without the CP being hung, since there's no active CP for the watchdog thread to monitor. I've never waited more than a minute or two after it hangs, I just use Windows' Task Manager to end the task. |
Hi Peter, Fish's comment about using 127.0.0.1 spurred a thought. Another possible resolution for this would be to allow the user to configure the local IP address. Since all the IPs in the 127.0.0.0 range refer to the local host, the user could assign unique "localhost addresses" to each instance and achieve the unique lookup value that way. For example, I could use 127.0.0.1 for JWS1, 127.0.0.2 for JS02 and 127.0.0.5 for JS05 in my configurations and the CTCE code would see them all as unique host/devnums. |
Hi Jeff, Thanks for your positive feedback re. my workaround. Yes, that would work, but I do not know of how to ensure that packets to 127.0.0.1 / .2 / .5 would be correctly routed / delivered to the correct Hercules instances / processes, or put differently, how to establish these localhost address to ensure that. And to top it off, how to ensure the same technique could work for all Hercules' supported platforms, Windows, Linux, and MacOS. One approach which I think could be made to work, at least under Linux and MacOS (up to MacOS version 10.x.y, but not 11.x.y), is establishing additional TAP interfaces, and bridge these together (also the hardware NIC) under a master, and give each TAP interface its own unique address in the same LAN as the NIC, and then configure the Hercules instances with those addresses accordingly. But whether this configuration overhead is less effort than managing unique lport addresses (after my upcoming patch for it that is), is, I think, questionable. As soon as my upcoming patch is available, your very initial configuration with all remote devnums specified should also work, so my workaround should then no longer be needed. I propose to keep this issue open until we've been able to confirm that. Cheers, Peter |
Fish wrote :
An interesting workaround I saw by Rob Prins in turnkey-mvs@groups.io to ensure "localhost" is always the IPv4 127.0.0.1, is to add an entry for that in the etc/hosts file (thanks Rob !) :
Cheers, Peter |
(paraphrased):
Yes, that would work too. But why go to that trouble when IMO it's easier to simply use 127.0.0.1 in your Herc config file instead? Bottom line: you can either: a) update your etc/hosts file leaving your existing Herc config file alone, or b) simply change your Herc config file instead. Either way, you still need to change something, and IMO it's easier to simply always use 127.0.0.1 in your Herc config file instead. Six of one, half a dozen of the other. <shrug> |
Well, I'm marginally interested in that too, but I believe Peter probably has that well in hand. What I'm mostly interested in right now is the original Processor CP00 hang.
Correct: the crash would not occur since no processors were hung. Now, if a deadlock was detected, then that would certainly cause a crash. But since no deadlock was reported we know that's not the cause for the hang, and as I said it is the Processor CP00 hang that I'm mostly interested in at this point. When I get a chance (I keep getting distracted (torn away) back and forth between several different things I'm looking into) I'll try to see if I can recreate the Processor hang using a different guest operating system (such as VM/370 SixPack maybe). If I discover anything I'll let you know. (Oh yeah! That original crash dump you sent me? It was a bust. It unfortunately told me nothing. That's why I need to fall back to Plan B: try recreating the hang/crash for myself) |
Peter,
There is no additional configuration required. The packets will be delivered to the existing TCPIP stack, just as if 127.0.0.1 had been used, with the receiving Hercules instance being controlled by the destination port number. For example, with no changes to my PC, I can ping 127.0.0.1, 127.0.0.2, 127.0.0.3, etc...You can just think of all the other 127.0.0.x addresses as aliases for 127.0.0.1. The advantage is that when the packet arrives, it has a different "label" on it, the unique IP address. Fish, Thanks for the explanation. I was pretty sure that dump wouldn't be useful for the original issue, but I'm disappointed it wasn't helpful on the CP hang. Let me know if there's any way I can help. |
The 2 problems identified in Issue #369 for CTCE configurations are now corrected : 1. The "rport" parameter is now always taken into account when matching incoming CTCE connections, and no longer ignored when "rdevnum" is specified. 2. The "rdevnum" when specified as "=" was not incremented correctly when the "ldevnum.n" format was used to specify "n" multiple CTCE devices.
I have just committed the fixes for the 2 issues that were identified. As a result, Jeff's original configuration will now work as well, as my test confirmed. The second problem identified with the non-incrementing remote devnums when specifying multiple CTCE devices using a single config entry is now also fixed. I successfully tested also these configurations :
CTCE port number specifications are only needed for multiple Hercules instances on the same PC, but one of them (in my examples JS05) can just use the default 3088 (if that port number isn't used for something else, that is). As the device numbers on a given Hercules instance have to be unique, a single lport per instance is sufficient, which is a wee bit more efficient than an lport per CTCE device. The devlist output from the 3 log files show (noting that the fisrt "3088" is not a port number, but the CTC device type 3088) :
In case anyone wonders why some CTCE connections are shown with a "<=>" and others with "<->", well, the "<=>" sides of CTCE connections are contention winner sides, the "<->" are the contention loser sides. If we're all pleased with this then I propose to close this issue. OK? Cheers, Peter |
I agree if Jeff agrees. I can work on my Processor CP00 hang issue offline at my own leisure, and simply add a new additional comment if/whenever I have something to report. There's no need to keep it open for my sake. Is that fine with you, Jeff? |
Hi Peter, Thanks for the quick work! I'll get it tested and let you know about closing the issue ASAP. |
Hi everybody. I was able to get Hercules rebuilt and tested everything. It all looks good. Peter, Fish, |
OK, thanks for the positive feedback. I'll close #369 now. Cheers, Peter |
Hi,
I'd like to report the following issues when using the CTCE driver to emulate CTCs. Please note:
In the following, the Hercules instance referred to as JS05 is running OS/390 2.10, the two "peer" instances (JWS1 and JS02) are running VM/SP5.
The attached "ports" configuration and log files show the following.
With this configuration (Note: this is the complete configuration for this issue):
The following errors are generated:
Please don't get sidetracked by the leading zero in the port number, this does not cause any impact.
While I know ports below 1024 require the user to be authorized, they are still valid port numbers and should be allowed. If the user tries to use a reserved port while unauthorized, an appropriate error could be generated at that time. Since I have to run authorized to support my networking requirements, those ports should be available to me.
The complete configurations are in the attached zip file, but the significant portions are:
JS05 (the hub):
JWS1 (peer 1):
JS02 (peer 2):
As you can see, both JWS1 and JS02 have their CTCs connected to addresses 600-601. When bringing up the JS05 and JWS1 instances of Hercules by themselves, no problems were encountered. The OSes were IPLed and communication established over the link. A
devlist
of the affected devices show good connections:As soon as the Hercules instance for JS02 was brought up, the CTC connections between JS05 and JWS1 were "renewed", causing them to fail:
The resulting
devlist
shows the device entries for JS05 devices 530 and 531 have been overwritten with new remote port numbers:JS02's OS was never IPLed.
Soon after gathering the evidence for item (2) above, I attempted to "quit" Hercules instance JS02.
JS02:
This appeared to shutdown properly, but had adverse effects on JS05 and JWS1:
JS05:
At this point, JS05 recorded a crash dump.
JWS1:
I was then able to shutdown JWS1 normally:
My estimation of what might be going on:
By changing the OS configurations to use unique device addresses, I have been able to run for almost 24 hours with these 3 and several more images all interconnected. Instances have been brought up and down without affecting other instances (except their connection peers).
It looks like CTCE is using the remote device number, possibly in addition to the remote IP address as some kind of lookup value to keep track of multiple TCP sessions. In my case, since all the Hercules instances run on the same computer, the IP address does not vary, so multiple instances with the same device number result in the same lookup value.
If you have any questions or I can provide any additional information, please contact me at jsnyder1369 at google dot com.
Please note, the crash dump taken by JS05 is 611 MB and compresses down to 49 MB. Git hub will only allow attachments of 10 MB or less, so if you need access to the dump, please let me know and we can work out other means to get it to you.
Thanks for your help!
CTCE issues.zipThe text was updated successfully, but these errors were encountered: