Website founded by Milan Velimirović in 2006
2:57 UTC
| |
MatPlus.Net Forum General Creation of benchmark(s) for software solving orthodox moremovers |
|
|
|
You can only view this page!
| | (1) Posted by Marcin Sterkowiec [Tuesday, Sep 17, 2024 11:50] | Creation of benchmark(s) for software solving orthodox moremovers I was thinking a bit about suggestion by Hauke Reddmann to create benchmark(s) for software solving orthodox moremovers (in this thread: https://matplus.net/start.php?px=1726562371&app=forum&act=posts&fid=gen&tid=3191 )
"I would extend the suggestion to compare:
Compare *all* problem solving engines (might include databases, might include
a standard LiChess Stockfish engine) for various aspects. There is the classic
Alybadix, there is the completely unknown MateMaster (which I still use
for composing #2, it's a bit easier than Popeye/Olive) etc..."
After some consideration it seems quite a large task involving a few methodological and pracical issues. Here are my thoughts:
As suggested by Hauke we should try to test various aspects separately, for example:
1) Test speed (in brute-force sense) on a random set of #4-#9 compositions (#4-#6 all solutions; #7-#9 single solution; maybe also: #4/#7 more pieces, #6/#9 fewer pieces on board)
2) Test speed/cleverness on a random set of #15-#30 (this would have to be a bit larger set than in point (1) to avoid "random cleverness")
3) Test "magic capabilities" on a set of #31+
The main methodological issue is selection of problems: I think it has to be a random selection - but from which set?
(This randomness should let avoid issues with testing on problems for which programs were finetuned - I think every program has such a set)
I think for the beginning we should select from "sound" problems (no duals/no multiple solutions; once done, we might add more complexity)
Practical issues I can see:
a) Every program has some prerequisites (not installation alone). For example J.G.Island has a set of 5-piece tablebase that has to be downloaded separately (~4GB). As far as I know Gustav may require manual setting of parameters of analysis.
b) The benchmark should be done by someone who has access to full/commercial versions of all the programs in the benchmark | | (2) Posted by Marcin Sterkowiec [Thursday, Sep 19, 2024 20:35] | I have created a procedure for the first test (to test speed in brute-force sense).
The procedure and first tentative results (comparison of J.G.Island vs Popeye) can be found here: https://jgisland.pl/benchmark/procedure.txt
(Of course 1-thread Popeye that focuses on versatility couldn't be expected to be faster, although in case of the first problem - #4 - it indeed was slightly faster)
My question is whether such a procedure (with 10 sets of problems https://jgisland.pl/benchmark/sets.zip and a Python script rand.py https://jgisland.pl/benchmark/rand.py.txt to randomly draw 10 compositions from these sets) can generally be accepted as sufficient. If so, we might try to test a broader range of programs - and later also run the second version of the test (speed/cleverness using new sets of #15-30). | | (3) Posted by Viktoras Paliulionis [Thursday, Sep 19, 2024 20:41] | I think the test set should be representative enough to include different styles of problems with a variety of motives, for example:
1) Flight giving,
2) Underpromotion,
3) Zugzwang, tempo maneuvers,
4) Anticipatory white sacrifice.
The main purpose of the chess composer tool should be to search for cooks and duals. The program should show all variations and duals. If the program does not have such capabilities, then it is not very useful for testing chess problems. Perhaps, such programs should not even be included in the comparison. | | (4) Posted by Marcin Sterkowiec [Friday, Sep 20, 2024 00:07] | Yes, for sure this procedure can be much better - just as you suggested I have added five additional sets of problems to cover the motives you wrote about: see https://jgisland.pl/benchmark/procedureUpdated.txt and https://jgisland.pl/benchmark/setsExtended.zip
My intention is to create a simple procedure that would be just good enough, that might be generally considered to provide unbiased results (unbiased towards any of the programs)
Some of these problems within this random test suite (#7+) are intended to be solved in a single solution mode (MaxSolutions 1) in order to keep the overall time of the test in bounds.
As far as the second issue you mentioned is concerned, I think that the main goal is always to find a solution (and duals, if any) in the first move.
In my opinion showing all variants in case of moremovers (particularly those very long ones) may be not so much valuable (too much information at a time).
So from my perspective it seems just enough if this information is accessible on demand after the first/main search. Note that J.G.Island has an option to save all this information (it is called a "temporary register") to a disk file.
You can set the limit the size of the file - you can share this file. It is a powerful option - in my opinion much more powerful that just displaying a very, very long multiline text.
But indeed J.G.Island (in its first/main run) does not look for duals in moves other than the first one. At this moment of the analysis the program assumes that the overall speed is much more important for user than details about all the possible late duals (if they exist, if needed, they can be found later, during "interactive" analysis - usually it is a very fast operation since it uses all the gathered data - the temporary register).
Technically, first you make a black move/response (or wait for the best defence suggested by program) and then right click 'Start' button to force finding all the solutions for white in the second or later moves (clicking left mouse button would mean search for a single solution only). You can also find this information in the tooltip of the Start button (alternatively, you can use items from menu "Problem") | | (5) Posted by Viktoras Paliulionis [Friday, Sep 20, 2024 10:23] | To ensure unbiased results, all programs being compared must provide the same output. If one program does not search for duals, and another does, then it is clear that the one that does less work is likely to be significantly faster. As far as I know, neither Popeye nor WinChloe have an option to not search for duals, but only to refuse to search for short variations. | | (6) Posted by Joost de Heer [Friday, Sep 20, 2024 11:20] | For the Popeye output: I am in favour for an 'ignoreDualisticBlackDefenses' option which won't print black moves which have multiple white replies. That should clean up the output significantly, and if you want to see details why a certain defense is not printed, you can always test the intermediate position. | | (7) Posted by Viktoras Paliulionis [Friday, Sep 20, 2024 13:55] | Such an option would clean up the output, but would not significantly reduce the solving time, since the program would still have to look for duals. | | (8) Posted by Marcin Sterkowiec [Friday, Sep 20, 2024 17:55] | In my opinion we should simply run all the compared programs with all the available options for speed. Of course every engine has its specifics, it's obious that it should not be evaluated by this single criterion (speed in orthodox moremovers), but still a benchmark for speed is a... benchmark for speed. The comparison table may also have a clearly visible column saying which additional features are included within solving time (e.g. "search for late duals included [green]/not included[red]". That's how I see it... Probably we should wait for more feedback, including Hauke, who is probably the most interested (this thread may be considered his initiative) | | (9) Posted by Viktoras Paliulionis [Friday, Sep 20, 2024 22:58] | I did a test by solving one chess problem using different programs.
[7, '4n3/4K3/3BpNp1/1R1bP3/pPPk1pr1/2R3pQ/2PPr2n/q4Bb1']
---------- MaxSolutions 1 ----------
1. Gustav: 0.13 s (+all duals, automatic parameters, 8MB)
2. WinChloe: ≈3 s (4 threads), ≈4 s (2 threads), ≈10 s (1 thread) (+dual searching, 1GB)
3. J.G. Island: 9.92 s (4 threads), 16.98 s (2 threads), > 30 s (1 thread) (no dual searching, default parameters, 1GB)
4. Popeye: 451.12 s (+dual searching, 1GB)
---------- All solutions ----------
1. Gustav: 0.20 s (+dual searching, automatic parameters, 8MB)
2. WinChloe: 6.81 s (4 threads), 11.32 s (2 threads), 18.47 s (1 thread) (+dual searching, 1GB)
3. J.G. Island: > 30 s (8 threads) (no dual searching, default parameters, 1GB)
4. Popeye: 1207.76 s (+dual searching, 1GB)
WinChloe has a parameter "Stop solving if >= ... solutions", but it doesn't seem to work, so the time for 1 solution is approx.
Free J.G. Island has a time limit (30s), so some times are unknown. | | (10) Posted by Hauke Reddmann [Saturday, Sep 21, 2024 11:45] | Same problem, different solvers:
- MateMaster: Isn't built for that; "2%" after 2 hours or so, so expect a few days.
- LiChess Stockfish: Isn't built for that either, shows "#7" on level 99 in a few minutes.
- Hauke, intelligent mode: Isn't built for that anyway, but plays the key in a second
purely by instinct, then slowly sees why, would need a few minutes to sort details :-) | | (11) Posted by Marcin Sterkowiec [Saturday, Sep 21, 2024 14:56] | I have a feeling that we slightly moved out of our path - we took an arbitrary problem (from the 15 drawn) which is not consistent with the procedure intended to provide unbiased comparison. I hope we do want to reach this goal (i.e. a pretty good benchmark).
Another thing is that most probably (I may be wrong) Gustav is not solving in this case but taking the solution from database.
Of course, many solutions - including this one - are available immediately (e.g. by searching by FEN the following html: https://jgisland.pl/download/reports/moremovers_978g_vs_972o_byAuthor.html ) but this is rather not the point.
J.G.Island also has an option to immediately show the solution of all previously solved moremovers (#7+) - it is available by clicking either a small orange book located near Start button (if single solution available), or a small stack of books (if all solutions available).
(It's also worth noting that J.G.Island stores information about all solved problems in its log file - by default this log file is located in folder C:Program FilesJGIsland_ChessMoremovers)
Last but not least we should also be aware that parameters of CPU might have significant effect on solution time.
For example J.G.Island's default selection of threads adjusts to the number of processors/cores. Starting from 6 CPUs the default selection is 4^2 threads. Such selection may improve performance significantly.
Thus, we may also consider a separate benchmark for 4-core (older) and 8-core (more modern) CPUs - although it is debatable - as a matter of fact I think 8 cores/processors should be preferred (as anyone interested in speed would simply select a fast/modern machine). | | (12) Posted by Marcin Sterkowiec [Wednesday, Sep 25, 2024 19:10] | Maybe Olaf Jenkner could clarify on this? Because such an option to save and (re)use some data of already solved problems would indeed be valuable for users, but still for benchmarking should rather be off - we should assume that we are dealing with a new/unknown problem/position.
The only reasonable exception I can see are endgame tablebases.
J.G.Island - except for endgame tablebases - actually contains also custom tablebases (for positions like http://www.yacpdb.org/?id=89976 or https://pdb.dieschwalbe.de/search.jsp?expression=PROBID='P1200254') - their usage during benchmark test may be debatable. Anyway I would assume that any kind of tablebases that covers all combinations of movable pieces (while move/capture of any other one would prevent mate) should be OK for benchmarking. | | (13) Posted by Olaf Jenkner [Wednesday, Sep 25, 2024 23:38] | What has to be clarified?
Gustav does not (re)use some data of already solved problems.
Solved and stored problems are shown immediatly on a gray screen. | | (14) Posted by Marcin Sterkowiec [Thursday, Sep 26, 2024 10:14] | Thank you, Olaf. Now it's clear.
Could you maybe also confirm that the following solution time (0.2 sec for 'All solutions') looks correct as for a typical machine/CPU (parameters of benchmark machine not specified above by Viktoras) :
#7 '4n3/4K3/3BpNp1/1R1bP3/pPPk1pr1/2R3pQ/2PPr2n/q4Bb1' http://www.yacpdb.org/?id=306402
---------- All solutions ----------
1. Gustav: 0.20 s (+dual searching, automatic parameters, 8MB)
(I am asking because for me it looks much too fast as for the run without (re)using any data from previous analyses) | | (15) Posted by Viktoras Paliulionis [Thursday, Sep 26, 2024 19:32] | I used a typical PC with an Intel Core i7-8700 @ 3.20 GHz. Without optimization, the solution time with Gustav (64 MB) is 22.4 s, which is comparable to WinChloe (1GB, 1 thread) - 18.5 s.
Optimization significantly speeds up solving but does not ensure that all solutions and duals are found. | | (16) Posted by Olaf Jenkner [Thursday, Sep 26, 2024 19:40] | Yes, I get nearly the same computation time.
Automatic parameters are made for a quick solving. Sometimes cooks or duals are found with these parameters.
It is not a brute force test. | | (17) Posted by Marcin Sterkowiec [Thursday, Sep 26, 2024 22:55] | I have a few questions/doubts about "automatic parameters" in Gustav:
- is time used for finding of automatic parameters included in solving time?
- you write that "sometimes cooks or duals are found with these parameters" - at first glance the word "sometimes" seems to me to be in contradiction with "All solutions"; I would expect "always" (of course as far as duals in the first move are concerned, not late duals)
Is there maybe some brochure/information available that would explain more thoroughly the concept of "automatic parameters"? I must admit that my initial intuition is that it may be some "hash" of solution data (or a strong indication based on the known solution) and, as such, should rather be excluded from benchmarking procedure - please correct me, in case I am wrong here. | | (18) Posted by Olaf Jenkner [Thursday, Sep 26, 2024 23:46] | Of course soving time is the time elapsed between click at the solution button
and display of the solution.
"All solutions" are all solutions found with the same parameters like the first solution.
See the excerpt from the German manual:
7. Parameter für die Lösungssuche
7.1 Alle Lösungen / eine Lösung
Im linken Teil des Hauptfensters wird eingestellt, ob Gustav alle Lösungen
suchen soll oder nach der Ermittlung der ersten die Berechnung
beenden soll. Die Begrenzung auf eine Lösung kann unter Umständen
sehr viel Zeit sparen, weil sonst erst alle weißen Anfangszüge untersucht
werden, bevor die Varianten berechnet werden.
Der Abbruch der Berechnung im Modus alle Lösungen nach Auffinden
des Schlüsselzugs hat zur Folge, daß die weiteren Züge der Lösung
nicht angezeigt werden. Im Modus eine Lösung wird die Berechnung
sofort nach dem Finden des ersten Lösungszugs beendet. Zum Prüfen
von Schachproblemen müssen natürlich alle Lösungen gesucht werden.
Bei vollzügigen Varianten werden auch alle Duale ermittelt. Diese werden
aber nicht automatisch erkannt, sondern müssen anhand der Lösungsnotation
gesucht werden.
7.2 Zugzahl unbekannt
Wenn die Zugzahl unbekannt ist oder eine kürzere Lösung als die vom
Autor angegebene zu vermuten ist, kann man den Schalter Zugzahl unbekannt
aktivieren. Gustav rechnet dann die Aufgabe zuerst mit der
Zugzahl 1 und erhöht diese solange um 1, bis die Lösung gefunden wurde
oder die Zugzahl der Aufgabe erreicht ist.
Der Vorteil davon ist, daß die Berechnung wesentlich schneller geht,
falls die Lösung tatsächlich in weniger Zügen gefunden wird. Nachteilig
ist die etwas längere Berechnungsdauer in den Fällen, wo die Lösung
nicht vor der angegebenen Zugzahl gefunden wird. Außerdem ist zu beachten,
daß die Berechnung bei der Zugtiefe, wo die Lösung gefunden
wird, beendet wird. Längere Lösungen werden dann nicht gefunden.
Gustav zeigt jeweils die benötigte Rechenzeit für die vorangegangenen
Suchtiefen an. Damit ist eine grobe Abschätzung für den restlichen Zeitaufwand
möglich.
7.3 Lösungsparameter
Langzügige Schachaufgaben lassen sich meist nur in vernünftiger Zeit
lösen, wenn man nicht alle infrage kommenden weißen Züge untersucht.
Die Art der Einschränkungen werden durch die Lösungsparameter
festgelegt. Für Hilfsmatts bzw. Hilfspatts bietet Gustav außer dem
Mattfeld keine speziellen Berechnungsmodi an.
7.3.1 Automatische Parameter
Die Berechnung mit eingeschalteter Automatik ist die einfachste Art,
die Lösung eines Schachproblems zu finden. Gustav ermittelt selbständig
die Parameter, um möglichst schnell die Lösung zu erhalten. Die
Parameter, bei der die Lösung gefunden wurde, werden unter dem Lösungsfeld
angezeigt. Es ist zu beachten, daß unter der Einstellung alle
Lösungen weitere Lösungen nur gefunden werden, wenn die automatisch
ermittelten Parameter dafür ausreichen. Die Einstellung Zugzahl
unbekannt ist zusammen mit automatische Parameter nicht möglich.
7.3.2 Brute Force
Bei dieser Einstellung wird die Aufgabe ohne Einschränkungen berechnet,
weshalb hier die Lösezeit am größten ist. Eine 100-prozentige Prüfung
auf Korrektheit kann nur mit Brute Force erfolgen. Die meisten
Nebenlösungen werden jedoch bei relativ starken Einschränkungen gefunden.
7.3.3 Eigene Parameter
Durch geschickte Wahl dieser Einstellungen läßt sich die Lösungssuche
erheblich beschleunigen. Gustav schränkt den Suchbaum ein, indem nur
weiße Züge untersucht werden, die bestimmte Kriterien erfüllen. Dazu
gibt es 7 Möglichkeiten:
1. Begrenzung der Fluchtfelder des schwarzen Königs:
Wenn der schwarze König nach dem weißen Zug mehr Felder betreten
könnte als im Eingabefeld Fluchtfelder festgelegt sind, wird
dieser Zug nicht untersucht.
2. Betrachtung von Zügen, die forciert Matt drohen:
Im Eingabefeld Drohtiefe kann man angeben, in wieviel Zügen
Weiß Matt bzw. Patt drohen muß. Dabei gelten nur Zugfolgen als
Drohung, bei denen Weiß ständig Schach gibt. Der Wert 0 deaktiviert
diese Einschränkung.
3. Begrenzung der schwarzen Gegenzüge:
Im Eingabefeld maximale Zugzahl läßt sich festlegen, wieviele
Züge Schwarz nach einem weißen Zug ausführen darf. Das bedeutet,
daß weiße Züge, die dem Schwarzen zu viel Züge lassen,
gar nicht erst untersucht werden. Für diese starke Einschränkung
kann man eine bestimmte Zahl von Ausnahmen festlegen. Außerdem
kann man angeben, ab welcher Zugnummer und bis zu welcher
Zugnummer die maximale Zugzahl gelten soll.
4. Festlegen des Mattfeldes:
Wenn klar ist, auf welchem Feld, welcher Zeile oder Reihe der König
mattgesetzt wird, kann man durch die Vorgabe des Mattfeldes
die Lösungssuche beschleunigen.
5. Festlegen der Figuren, in die Weiß umwandeln kann:
Falls die Umwandlung in bestimmte Figuren nicht sinnvoll erscheint,
kann man diese ausschließen.
6. Festlegen der Figuren, die Weiß schlagen darf:
Wenn bei einem Selbstmatt/-Patt das Matt/Patt von einer bestimmten
Figur zu erwarten ist, kann man festlegen, daß diese Figur
nicht geschlagen werden darf. Das erspart unnötigen Rechenauf-
wand für Stellungen ohne diese Figur.
7. Matt möglich durch Umwandlungsfigur:
Wenn bei einem Selbstmatt Schwarz nur noch Bauern hat, welche
alle auf der zweiten Reihe stehen, ist nur noch ein Matt durch eine
Umwandlungsfigur möglich. Da dies sehr unwahrscheinlich ist,
sucht Gustav bei deaktiviertem Parameter in solchen Positionen
nicht weiter.
8. Höchstzahl der weißen Steine:
Der Parameter steht nur für Selbstpatts zur Verfügung. Er gibt
an, wieviel Steine Weiß in der Pattstellung noch haben darf. Das
spart Zeit, weil schon mehrere Züge vor dem Ende erkannt wird,
daß kein solches Patt mehr möglich ist.Wenn diese Einschränkung
nicht gewünscht ist, wird 0 eingetragen.
Diese 8 Möglichkeiten zur Einschränkung der zu untersuchenden weißen
Züge lassen sich beliebig kombinieren. | | (19) Posted by Marcin Sterkowiec [Friday, Sep 27, 2024 18:53] | Thank you, Olaf, for this information. It let me understand better the concept of "automatic parameters" in Gustav (in case you have an English version of the manual, please also share)
After some reconsideration, I came to conclusion that... it all is much, much more complicated than I expected originally.
Maybe it will be easier, if I start from describing how J.G.Island works: for its analysis it takes three things:
1) position
2) stipulation
3) tablebases
Originally I thought that I can simply say that based on such a tuple J.G.Island makes decisions and provides solution. But actually this is not exactly true: J.G.Island NEVER EVER makes a decision/condition based on full position/FEN. There is not a single "if (FEN == thisPosition) then..." (well, actually it is - but only in a "technical" pieces of code, where I want to catch a breakpoint on a particular position while debugging).
All decisions are made by position's properties/attributes (well, yes, I have a few places in code commented with "quasi-hardcoding for problem xyz" where these attributes are specified in the way that very few positions match - probably I was in a hurry and/or tired - I have just found two such places by searching code). Why did I do it like this? Hoping that some day I will arrive to the point where the code will be well prepared not only for the existing but also for the future problems.
In some of my previous posts (in this thread: https://matplus.net/start.php?px=1727455676&app=forum&act=posts&fid=gen&tid=3191 ) I wrote that the initial analysis of position by J.G.Island is usually simple and fast (<0.2 msec) but in special cases it may take a few seconds. In order to speed it up and siplify I could have a table for such more demanding cases that would map FEN into optimal parameters of analysis (it would cost almost nothing to look up this table). Would it be valueble for users? Of course it would be. They would potentially save these few seconds when analysing exactly this very positions and would have optimal parameters for solving. But still I think that such behaviour should be optional and should be switched off during benchmarking run. Why? If I have a table of known FEN codes (problems) and the best analysis parameters for them, I actually take advantage of more knowledge than just FEN itself - I take advantage of knowing the solution for this FEN.
I think that this approach may be called a sort of "philosophy" - and my "philosophy" is to never ever make decision in program based on a full position/FEN (except for tablebases and positions already solved meanwhile; of course J.G.Island lets for exceptions from this rule: there's an option "Always clear temporary data when starting solving" that is checked by default, but an advanced user can override it and take advantage of some previously solved group of positions)
To sum up:
1) Only one thing seems obvious: I think that, judging by what I read in the Gustav manual, the most suitable mode for benchmarking "All solutions" would be "Brute force" option ("All solutions" actually does not search for all solutions..)
2) The remaining seems to actually be about "philosophy" - and I think we would have to agree on some common "philosophy" (common "denominator"?) in order to be able to prepare a good benchmark | | No more posts |
MatPlus.Net Forum General Creation of benchmark(s) for software solving orthodox moremovers |
|
|
|