supervise not running

Discussion:

supervise not running - want down

Bastien Devos

2004-02-25 15:34:41 UTC

Hi,

I'm using djbdns on two servers, a master (Solaris 8) and backup (Fedora Core 1
Linux).
On each server, I have a instance of dnscache running on the public ethernet
interface, and querying the tinydns server (running on 127.0.0.1) for the
authoritative entries, forwarding all the rest to an external server.

DNS queries are working fine on client machines (Solaris, Linux, Windows), but I
have a question for which I didn't find a satisfying answer either on google or
the m/l archive.

I launched my dns servers many weeks ago, and yesterday I ran an svstat command
to see if my processes are OK and what's the uptime.
So, on the master server, the Solaris one, I did the following command :

master# svstat /service/dnscache
/service/dnscache: up (pid 9400) 523 seconds

This is of course not normal, so I did the following to figure out what's wrong,
and get that first result :

master# while true; do svstat /service/dnscache; sleep 2; done
(...)
/service/dnscache: up (pid 9400) 630 seconds
/service/dnscache: up (pid 9400) 632 seconds
/service/dnscache: up (pid 9400) 634 seconds
/service/dnscache: up (pid 9400) 636 seconds
/service/dnscache: supervise not running
/service/dnscache: supervise not running
/service/dnscache: up (pid 9634) 2 seconds
/service/dnscache: up (pid 9634) 4 seconds
/service/dnscache: up (pid 9634) 6 seconds
/service/dnscache: up (pid 9634) 8 seconds
(...)

after some other 100's of seconds, I get another error :

(...)
/service/dnscache: up (pid 9634) 622 seconds
/service/dnscache: up (pid 9634) 624 seconds
/service/dnscache: up (pid 9634) 626 seconds
/service/dnscache: up (pid 9634) 629 seconds, want down
/service/dnscache: up (pid 10363) 1 seconds
/service/dnscache: up (pid 10363) 3 seconds
/service/dnscache: up (pid 10363) 5 seconds
(...)

once again, I get something else :

(...)
/service/dnscache: up (pid 11197) 592 seconds
/service/dnscache: up (pid 11197) 594 seconds
/service/dnscache: supervise not running
/service/dnscache: supervise not running
/service/dnscache: down 0 seconds, normally up, want up
/service/dnscache: up (pid 11778) 2 seconds
/service/dnscache: up (pid 11778) 4 seconds
(...)

When I execute the same command on the backup server (the linux box), I get this
output :

backup# svstat /service/dnscache/
/service/dnscache/: up (pid 19225) 3626400 seconds
backup# svstat /service/tinydns/
/service/tinydns/: up (pid 19228) 3626342 seconds

... no comment.

This problem is quite transparent for the client, but it's not very clean.
The error on the solaris box seems to appear every +/- 600, 700 seconds.

I guess this is related to Solaris and daemontools, but I don't have any idea of
what this could be ...

Any help would be very appreciated,

Thanks

Bastien.

Rob

2004-02-25 15:52:12 UTC

Permalink

Hi,

One idea - You might want to do truss on the supervise process starting
at second 600 or so and see if it is receiving a signal or there is some
kind of error that causes it to choke.

Rob

Post by Bastien Devos
Hi,
I'm using djbdns on two servers, a master (Solaris 8) and backup (Fedora Core 1
Linux).
On each server, I have a instance of dnscache running on the public ethernet
interface, and querying the tinydns server (running on 127.0.0.1) for the
authoritative entries, forwarding all the rest to an external server.
DNS queries are working fine on client machines (Solaris, Linux, Windows), but I
have a question for which I didn't find a satisfying answer either on google or
the m/l archive.
I launched my dns servers many weeks ago, and yesterday I ran an svstat command
to see if my processes are OK and what's the uptime.
master# svstat /service/dnscache
/service/dnscache: up (pid 9400) 523 seconds
This is of course not normal, so I did the following to figure out what's wrong,
master# while true; do svstat /service/dnscache; sleep 2; done
(...)
/service/dnscache: up (pid 9400) 630 seconds
/service/dnscache: up (pid 9400) 632 seconds
/service/dnscache: up (pid 9400) 634 seconds
/service/dnscache: up (pid 9400) 636 seconds
/service/dnscache: supervise not running
/service/dnscache: supervise not running

...

Paul Jarc

2004-02-25 17:18:19 UTC

Permalink

Post by Bastien Devos
Solaris

...

Post by Bastien Devos
/service/dnscache: supervise not running

<URL:http://marc.theaimsgroup.com/?l=djbdns&m=105234961406943&w=2>

paul

Bastien Devos

2004-02-26 09:21:34 UTC

Permalink

Post by Rob
One idea - You might want to do truss on the supervise process starting
at second 600 or so and see if it is receiving a signal or there is some
kind of error that causes it to choke.

I dit that, and here is the result:

master# truss -f -p 22939
22939: poll(0xFFBEFD38, 2, 1000020) (sleeping...)
22939: poll(0xFFBEFD38, 2, 1000020) = 1
22939: sigprocmask(SIG_BLOCK, 0xFFBEFC90, 0x00000000) = 0
22939: read(3, 0xFFBEFD17, 1) Err#11 EAGAIN
22939: waitid(P_ALL, 0, 0xFFBEFC30, WEXITED|WTRAPPED|WNOHANG) = 0
22939: read(6, " d", 1) = 1
22939: kill(22943, SIGTERM) = 0
22939: kill(22943, SIGCONT) = 0
22939: open("supervise/status.new", O_WRONLY|O_NDELAY|O_CREAT|O_TRUNC, 0644) = 9
22939: write(9, " @\0\0\0 @ =B0E117C801 L".., 18) = 18
22939: close(9) = 0
22939: rename("supervise/status.new", "supervise/status") = 0
22939: sigprocmask(SIG_UNBLOCK, 0xFFBEFC90, 0x00000000) = 0
22939: Received signal #18, SIGCLD [caught]
22939: siginfo: SIGCLD CLD_KILLED pid=22943 status=0x000F
22939: write(4, "\0", 1) = 1
22939: setcontext(0xFFBEF978)
22939: poll(0xFFBEFD38, 2, 1000020) = 2
22939: sigprocmask(SIG_BLOCK, 0xFFBEFC90, 0x00000000) = 0
22939: read(3, "\0", 1) = 1
22939: read(3, 0xFFBEFD17, 1) Err#11 EAGAIN
22939: waitid(P_ALL, 0, 0xFFBEFC30, WEXITED|WTRAPPED|WNOHANG) = 0
22939: open("supervise/status.new", O_WRONLY|O_NDELAY|O_CREAT|O_TRUNC, 0644) = 9
22939: write(9, " @\0\0\0 @ =B2B218 59414".., 18) = 18
22939: close(9) = 0
22939: rename("supervise/status.new", "supervise/status") = 0
22939: read(6, " x", 1) = 1
22939: open("supervise/status.new", O_WRONLY|O_NDELAY|O_CREAT|O_TRUNC, 0644) = 9
22939: write(9, " @\0\0\0 @ =B2B218 59414".., 18) = 18
22939: close(9) = 0
22939: rename("supervise/status.new", "supervise/status") = 0
22939: open("supervise/status.new", O_WRONLY|O_NDELAY|O_CREAT|O_TRUNC, 0644) = 9
22939: write(9, " @\0\0\0 @ =B2B218 59414".., 18) = 18
22939: close(9) = 0
22939: rename("supervise/status.new", "supervise/status") = 0
22939: _exit(0)
master#

Here is the same moment with svstat :

master# while true; do svstat /service/dnscache; sleep 2; done
(...)
/service/dnscache: up (pid 22943) 461 seconds
/service/dnscache: up (pid 22943) 463 seconds
/service/dnscache: up (pid 22943) 465 seconds, want down
/service/dnscache: supervise not running
/service/dnscache: supervise not running
/service/dnscache: up (pid 23524) 1 seconds
/service/dnscache: up (pid 23524) 3 seconds
^C
master#

why these 'kill' calls ?

b.

Post by Rob
Hi,
I'm using djbdns on two servers, a master (Solaris 8) and backup (Fedora Core 1
Linux).
On each server, I have a instance of dnscache running on the public ethernet
interface, and querying the tinydns server (running on 127.0.0.1) for the
authoritative entries, forwarding all the rest to an external server.
DNS queries are working fine on client machines (Solaris, Linux, Windows), but I
have a question for which I didn't find a satisfying answer either on google or
the m/l archive.
I launched my dns servers many weeks ago, and yesterday I ran an svstat command
to see if my processes are OK and what's the uptime.
master# svstat /service/dnscache
/service/dnscache: up (pid 9400) 523 seconds
This is of course not normal, so I did the following to figure out what's wrong,
master# while true; do svstat /service/dnscache; sleep 2; done
(...)
/service/dnscache: up (pid 9400) 630 seconds
/service/dnscache: up (pid 9400) 632 seconds
/service/dnscache: up (pid 9400) 634 seconds
/service/dnscache: up (pid 9400) 636 seconds
/service/dnscache: supervise not running
/service/dnscache: supervise not running

...

Mirko Steiner

2004-02-26 10:26:46 UTC

Permalink

what happens if you run the server in your shell without daemontools?
take a look in the ``run'' script of your service to see how the service
getting started. Does it then normaly or?

Mirko

Bastien Devos

2004-02-26 12:02:09 UTC

Permalink

Post by Mirko Steiner
take a look in the ``run'' script of your service to see how the service
getting started. Does it then normaly or?

Here is my run script for dnscache :

#!/bin/sh
exec 2>&1
exec <seed
exec envdir ./env sh -c '
exec envuidgid Gdnscache softlimit -o250 -d "$DATALIMIT" /usr/local/bin/dnscache
'

Paul Jarc

2004-02-26 14:39:51 UTC

Permalink

Post by Mirko Steiner
what happens if you run the server in your shell without daemontools?
take a look in the ``run'' script of your service to see how the service
getting started. Does it then normaly or?

This is a solved problem. Solaris sh is buggy, and to avoid
triggering the bug, you need to add " < /dev/null > /dev/msglog 2>&1"
to the end of the inittab line. Check the archives if you want the
full story.

paul

Bastien Devos

2004-02-27 08:41:53 UTC

Permalink

Adding " < /dev/null > /dev/msglog 2>&1" at the end of my inittab file seems to
solve the problem !

Thanks everybody, thanks Paul.

Bastien.

Post by Paul Jarc

Jonathan de Boyne Pollard

2004-02-26 14:39:46 UTC

Permalink

BD> 22939: read(6, " d", 1) = 1
BD> 22939: read(6, " x", 1) = 1

Somewhere you are running "svc -dx /service/dnscache" at 10 minute intervals.

Your problem is self-inflicted. (-:

Bastien Devos

2004-02-26 09:57:17 UTC

Permalink

Post by Paul Jarc
<URL:http://marc.theaimsgroup.com/?l=djbdns&m=105234961406943&w=2>

here is the last line in /etc/inittab on the solaris box :

sv:123456:respawn:/bin/sh /command/svscanboot

do I have to change it ?

Post by Paul Jarc
Solaris
...
/service/dnscache: supervise not running
<URL:http://marc.theaimsgroup.com/?l=djbdns&m=105234961406943&w=2>

Paul Jarc

2004-02-26 14:36:26 UTC

Permalink

Post by Bastien Devos

Post by Paul Jarc
<URL:http://marc.theaimsgroup.com/?l=djbdns&m=105234961406943&w=2>

sv:123456:respawn:/bin/sh /command/svscanboot
do I have to change it ?