About keysort/2 and msort/2...

Discussion:

Fabien Todescato

2003-03-07 23:02:06 UTC

Dear SWI-Prologers,

Do you know if keysort/2 and msort/2 are stable sorting predicates ?
That is, do these predicates keep the relative order of elements which
compares equal for the underlying sorting ordering ?

I need a stable keysort algorithm and would rather use the built-in
keysort/2 than reinventing the wheel.

Best regards, Fabien Todescato

----------------
* To UNSUBSCRIBE, please use the HTML form at

http://www.swi-prolog.org/mailinglist.html

or send mail to prolog-***@swi.psy.uva.nl using the Subject: "unsubscribe"
(without the quotes) and *no* message body.

** An ARCHIVE of this list is maintained at

http://www.swi.psy.uva.nl/projects/SWI-Prolog/mailinglist/archive/

Richard A. O'Keefe

2003-03-10 03:11:08 UTC

Permalink

Do you know if keysort/2 and msort/2 are stable sorting predicates ?

Any Prolog where keysort/2 is NOT stable is badly broken.

You will find keysort/2 in boot/sort.pl
Yes, it is stable.

The name `msort/2' originally signified MERGE sort and was
definitely intended to be a stable sort.

You will find sort/2 and msort/2 in src/pl-list.c
No, msort/2 is NOT stable.
But you can't possibly tell, so it doesn't matter.

I note that pl-list.c defines sort/2 and msort/2 in terms of C's qsort()
procedure. This is *NOT* a good idea. I've known commercial C
compilers implement qsort() as Shell sort or worse (yes, really) and the
O(n**2) worst case behaviour of so-called "quick sort" is no mere
theoretical possibility. I'm tired of people quoting textbooks at me to
the effect that with random inputs the O(n**2) worst case is extremely
unlikely; REAL data sets are NOT random (unless you've taken special
care to randomise them) and O(n**2) run times are quite commonly
observed. There are a number of tricks one can play to reduce the
likelihood, median-of-medians pivot selection, fat pivot, but it's all
so complicated and so pointless given the higher efficiency of merge
based sorts.
[Hint: what's _bad_ about having qsort_compare_standard()?]
[Hint: for what inputs is David Warren's sort/2 in Prolog O(N),
and which well known sorting algorithm is likely to go O(N**2)
on such inputs?]

David Warren's classic sort/2 goes like this:

sort(List, Sorted) :-
length(List, Length),
sort_(Length, List, _, Sorted).

sort_(N, L0, L, S) :-
( N > 2 ->
A is N >> 1,
Z is N - A,
sort_(A, L0, L1, S0),
sort_(Z, L1, L, S1),
ord_union(S0, S1, S)
; N =:= 2 ->
L0 = [X1,X2|L],
compare(R, X1, X2),
sort2(R, X1, X2, S)
; N =:= 1 ->
L0 = [X1|L],
S = [X1]
; N =:= 0,
L0 = L,
S = []
).

sort2(<, X1, X2, [X1,X2]).
sort2(>, X1, X2, [X2,X1]).
sort2(=, X1, _, [X1]).

ord_union([], S2, S2).
ord_union([X1|S1], S2, S) :-
ord_union1(S2, X1, S, S1).

ord_union1([], X1, [X1|S1], S1).
ord_union1([X2|S2], X1, S, S1) :-
compare(R, X1, X2),
ord_union12(R, X1, S, S1, X2, S2).

ord_union12(<, X1, [X1|S], S1, X2, S2) :-
ord_union1(S1, X2, S, S2).
ord_union12(>, X1, [X2|S], S1, X2, S2) :-
ord_union1(S2, X1, S, S1).
ord_union12(=, X1, [X1|S], S1, _, S2) :-
ord_union(S1, S2, S).

Changing this to msort/2 is fairly obvious, we just do merging
instead of unioning.

msort(List, Sorted) :-
length(List, Length),
msort_(Length, List, _, Sorted).

msort_(N, L0, L, S) :-
( N > 2 ->
A is N >> 1,
Z is N - A,
msort_(A, L0, L1, S0),
msort_(Z, L1, L, S1),
ord_merge(S0, S1, S)
; N =:= 2 ->
L0 = [X1,X2|L],
compare(R, X1, X2),
msort2(R, X1, X2, S)
; N =:= 1 ->
L0 = [X1|L],
S = [X1]
; N =:= 0,
L0 = L,
S = []
).

msort2(<, X1, X2, [X1,X2]).
msort2(>, X1, X2, [X2,X1]).
msort2(=, X1, X2, [X1,X2]).

ord_merge([], S2, S2).
ord_merge([X1|S1], S2, S) :-
ord_merge1(S2, X1, S, S1).

ord_merge1([], X1, [X1|S1], S1).
ord_merge1([X2|S2], X1, S, S1) :-
compare(R, X1, X2),
ord_merge12(R, X1, S, S1, X2, S2).

ord_merge12(<, X1, [X1|S], S1, X2, S2) :-
ord_merge2(S1, X2, S, S2).
ord_merge12(>, X1, [X2|S], S1, X2, S2) :-
ord_merge1(S2, X1, S, S1).
ord_merge12(=, X1, [X1|S], S1, X2, S2) :-
ord_merge2(S1, X2, S, S2).

ord_merge2([], X2, [X2|S2], S2).
ord_merge2([X1|S1], X2, S, S2) :-
compare(R, X1, X2),
ord_merge12(R, X1, S, S1, X2, S2).

Keysort is nearly the same as msort, except for how the comparisons are done.
(There's no reason keycompare/3 couldn't be a built in predicate.)

keysort(List, Sorted) :-
length(List, Length),
keysort_(Length, List, _, Sorted).

keysort_(N, L0, L, S) :-
( N > 2 ->
A is N >> 1,
Z is N - A,
keysort_(A, L0, L1, S0),
keysort_(Z, L1, L, S1),
key_merge(S0, S1, S)
; N =:= 2 ->
L0 = [X1,X2|L],
key_compare(R, X1, X2),
msort2(R, X1, X2, S)
; N =:= 1 ->
L0 = [X1|L],
S = [X1]
; N =:= 0,
L0 = L,
S = []
).

key_compare(R, K1-_, K2-_) :-
compare(R, K1, K2).

key_merge([], S2, S2).
key_merge([X1|S1], S2, S) :-
key_merge1(S2, X1, S, S1).

key_merge1([], X1, [X1|S1], S1).
key_merge1([X2|S2], X1, S, S1) :-
compare(R, X1, X2),
key_merge12(R, X1, S, S1, X2, S2).

key_merge12(<, X1, [X1|S], S1, X2, S2) :-
key_merge2(S1, X2, S, S2).
key_merge12(>, X1, [X2|S], S1, X2, S2) :-
key_merge1(S2, X1, S, S1).
key_merge12(=, X1, [X1|S], S1, X2, S2) :-
key_merge2(S1, X2, S, S2).

key_merge2([], X2, [X2|S2], S2).
key_merge2([X1|S1], X2, S, S2) :-
key_compare(R, X1, X2),
key_merge12(R, X1, S, S1, X2, S2).

I tested this code in Quintus Prolog and SWI Prolog (renaming the
top-level predicates to avoid clashes with built-ins). For a timing
test, I selected 10000 words at random from /usr/dict/words, and did
findall(W, w(W), Ws),
statistics(runtime, _),
sort(Ws, _),
statistics(runtime, [_,T|_])
with the obvious changes for the other predicates.

To my surprise, the version of sort/2 above was 12% faster than the
one in Quintus Prolog, which I _thought_ was pretty much exactly this code.
Also to my surprise, the built-in keysort/2 was *faster* than the built-in
sort/2 in Quintus Prolog. Maybe they gave it special treatment after I left,
as it is such a VERY useful predicate.

I got two other surprises in SWI Prolog version 5.0.8.
First, sort/2 as shown above is INFINITELY faster than the built-in
(C coded) sort/2, because when I run the timing code shown above, I get
ERROR: sort/2: Caught signal 11 (segv)
Now I can understand sort/2 *failing* if the input is not a list.
But how can sort/2 possibly get a SIGSEGV exception?

Second, keysort/2 as shown above is more than twice as fast on this
test set as the built-in one. I was getting times like 8.6 seconds
with the built-in and 3.75 seconds with the keysort/2 code above.
Can it be that the boot/*.pl files are compiled with 'optimise' false?

A good way to implement these sorting methods via a hybrid of Prolog and C
is
copy_for_sort([], [], N, N).
copy_for_sort([H|T], [H|C], N0, N) :-
N1 is 1 + N0,
copy_for_sort(T, C, N1, N).

sort(List, Sorted) :-
copy_for_sort(List, Working, 0, N),
$sort(Working, N, 1, 0),
Sorted = Working.

msort(List, Sorted) :-
copy_for_sort(List, Working, 0, N),
$sort(Working, N, 0, 0),
Sorted = Working.

keysort(List, Sorted) :-
copy_for_sort(List, Working, 0, N),
$sort(Working, N, 0, 1),
Sorted = Working.

ukeysort(List, Sorted) :-
copy_for_sort(List, Working, 0, N),
$sort(Working, N, 1, 1),
Sorted = Working.

Here $sort(List, Length, Discard-Duplicate-Keys?, Key-Is-First-Argument?)
sorts List by destructively redirecting "tail" pointers. Samsort is a good
sorting method here, it requires O(lgN) workspace, which means that a quite
small array allocated on the C stack is quite good enough, and it causes NO
memory management activity whatever in C code.

I'm afraid that I don't understand the C internals of SWI Prolog well enough
to implement this myself, but if anyone's interested I'll be happy to
explain further.

----------------
* To UNSUBSCRIBE, please use the HTML form at

http://www.swi-prolog.org/mailinglist.html

or send mail to prolog-***@swi.psy.uva.nl using the Subject: "unsubscribe"
(without the quotes) and *no* message body.

** An ARCHIVE of this list is maintained at

http://www.swi.psy.uva.nl/projects/SWI-Prolog/mailinglist/archive/

Jan Wielemaker

2003-03-10 10:44:34 UTC

Permalink

Richard,

Interesting. I did some tests you can find at the TWiki page (bit too
big to send in all source-code.

http://gollem.swi.psy.uva.nl/twiki/pl/pub/Development/PrologSort/sort_test.pl

Post by Richard A. O'Keefe
I got two other surprises in SWI Prolog version 5.0.8.
First, sort/2 as shown above is INFINITELY faster than the built-in
(C coded) sort/2, because when I run the timing code shown above, I
get ERROR: sort/2: Caught signal 11 (segv)
Now I can understand sort/2 *failing* if the input is not a list.
But how can sort/2 possibly get a SIGSEGV exception?

Possibly you run on a machine for which memory mapped stacks
don't work (there was an error in the configure routine that prevented
the tests to work correctly on some Solaris versions; this is fixed in
5.1.9). I also fixed the stack-shifter based version to call the
stack-shifter from sort/2, which makes sort happily sorting 50,000 words
in versions using the stack-shifter. This once worked before,
but as the stack-shifter isn't used on many systems bugs more
often pass unnoticed.

Now back to the real stuff :-) This is my result sorting (the same)
50,000 words from /usr/share/dict/words. I used 50,000 rather
than 10,000 as my machine is too fast to get sensible results
on 10,000:

Test executed on pl-5.1.9 (multi-threaded), Linux 2.4, gcc 3.2,
dual AMD 1600+.

% pl -sort_test.pl
<banner>

1 ?- test.
sort 0.25
msort 0.23
keysort 4.90
ok_sort 1.73
ok_msort 1.80
ok_keysort 1.96

I guess the test-names are `self-explanatory'.

Of course we depend on the quality of the C-library (in this case
glibc 2.2.5), but is seems worthwile doing the job in C for the
time being.

It also seems worthwhile to either make keysort use the same
route, or at least replace it with your version. Am I free to use
your code in boot/sort.pl?

--- Jan

P.s. You can't write destructive assignment through the formal
C-interface. Only if you are prepared to use the internal
interface. setarg/3 tells you how to do this :-)

----------------
* To UNSUBSCRIBE, please use the HTML form at

http://www.swi-prolog.org/mailinglist.html

or send mail to prolog-***@swi.psy.uva.nl using the Subject: "unsubscribe"
(without the quotes) and *no* message body.

** An ARCHIVE of this list is maintained at

http://www.swi.psy.uva.nl/projects/SWI-Prolog/mailinglist/archive/

Alan Baljeu

2003-03-10 15:25:59 UTC

Permalink

I have been unable to get string to do anything useful. It always returns no, whether
Term is ground or not, including for "aeuo", 'aoeua uaueo ea', 12345, and hello.

Is there a use for this predicate? Is there any predicate that returns
atomic(X), \+ number(X)
?

----------------
* To UNSUBSCRIBE, please use the HTML form at

http://www.swi-prolog.org/mailinglist.html

or send mail to prolog-***@swi.psy.uva.nl using the Subject: "unsubscribe"
(without the quotes) and *no* message body.

** An ARCHIVE of this list is maintained at

http://www.swi.psy.uva.nl/projects/SWI-Prolog/mailinglist/archive/

Jan Wielemaker

2003-03-10 20:06:05 UTC

Permalink

Alan,

Post by Alan Baljeu
I have been unable to get string to do anything useful. It always returns no, whether
Term is ground or not, including for "aeuo", 'aoeua uaueo ea', 12345, and hello.

string_to_list(S, "hello"), string(S).

S = "hello"

See the section on strings. Strings normally have no syntactic
representation. They are mostly there for things that didn't get into
the ISO standard. They are sometimes useful as efficient byte-array
representation, especially to foreign code.

Post by Alan Baljeu
Is there a use for this predicate? Is there any predicate that returns
atomic(X), \+ number(X)

atomic(X) :-
( atom(X)
; integer(X)
; float(X)
; string(X)
).

number(X) :-
( integer(X)
; float(X)
).

Cheers --- Jan

----------------
* To UNSUBSCRIBE, please use the HTML form at

http://www.swi-prolog.org/mailinglist.html

or send mail to prolog-***@swi.psy.uva.nl using the Subject: "unsubscribe"
(without the quotes) and *no* message body.

** An ARCHIVE of this list is maintained at

http://www.swi.psy.uva.nl/projects/SWI-Prolog/mailinglist/archive/