



Problem You faced discrete optimization problem which is obviously NPhard. In such a problem there is a big space of feasible solutions and a goal function. You are required to find a solution with the value of goal function as least as possible.
Solution Simulated Annealing is one of the metaheuristics for the global optimization problems. To use simulated annealing you have to define space of feasible solutions, goal function and solution mutation.
(Until the discussion section all the examples will refer to the traveling salesman problem.)
Space of feasible solutions is a set of all the possible solutions to the problem. For example, feasible solutions of TSP are all the hamiltonian cycles in the full graph. Another space of solutions is a set of all permutation of vertices. Each of these permutations represents an order of vertex traversal. Strictly speaking, these two solution spaces are different, but for us they are almost the same. It is much more convenient to use the second representation. In our program we should define the class of solution (like "Sol").
Goal function is a function defined on the set of solutions giving the score for each solution. The solution score is usually integer of float value and means how good the solution is. We will consider minimization problem, so: less score = better solution. Very often goal function can be taken straight from the problem statement. From the programmer's perspective it is necessary to create a method "GetScore" in "Sol" class.
In this recipe we will refer to mutation in two ways: 1. Mutation sample is a particular perturbation of solution which preserves its feasibility. 2. Mutation procedure is a way to generate random mutation sample for a given solution. The example of mutation procedure for TSP problem is: choose two random vertices in a tour and swap them. There are V*(V1)/2 mutations samples, each of them is fully described by two indices (vertex numbers or indices in permutation to swap) and is produced with equal probability. Note that changing random element of permutation to random vertex is incorrect mutation since the result is not a permutation hence it breaks feasibility. The score of mutation is the score of perturbed solution minus the score of initial solution. It is very important that the space of solutions is connected by mutation which means that any solution can be produced from any other solution in finite number of mutations with positive probability. It is also quite cool that the score of mutation can often be calculated much faster than the score of the solution (for TSP it is O(1) instead of O(V)). Mutation code is the most important part of the whole simulated annealing solution. I will assume that there is a class of mutation samples (like "Mut") which has three methods: 1. generation ("Init" with "Sol" parameter) 2. getting delta = score difference ("GetScore") 3. perturbating the solution ("Apply" with "Sol" parameter)
The greedy optimizational approach is the following: 1. Take any initial solution. 2. Generate mutation for current solution. 3. If the mutated solution is better than the current one, then switch to it: if (score of mutation < 0) solution := mutated solution 4. Unless the stop condition is met, go to step 2. Usually the greedy approach quickly converges to the local minima i.e. it finds the solution which is not globally best but any single mutation makes it worse. Of course, this local minima can be very far from the globally best solution. That's why all the heuristics like simulated annealing where created=)
Simulated annealing works almost the same but it allows to make the solution worse with some probability. The greater the difference is, the less the probability is. Also the temperature parameter is introduced. The temperature characterizes the approximate score of mutations that are accepted with reasonable probability. Initially the temperature is high so that almost any mutation is accepted with probability close to 100%. Then the temperature is being "cooled". Hence the bad mutations are being accepted less and less frequently. At the end the temperature is very small so that almost any bad mutation is almost never accepted. In this case simulated annealing works like the greedy solution and goes to some local minima. Here is pseudocode for simulated annealing: 1. Take any initial solution. 2. Generate mutation for current solution. Let "delta" be its score. 3. If delta < 0, then switch to the mutated solution. Otherwise switch to the mutated solution with probability equal to exp(score / temperature). 4. Unless the stop condition is met, go to step 2. The chance of switching to mutated solution which is used here originates from laws of some physical systems. It is the most commonly used formula to calculate acceptance probability from delta and temperature. The behaviour of temperature parameter during annealing is determined by the socalled cooling schedule. 

Since simulated annealing acts independently from the problem statement, it can be implemented in a general way. Of course you may have to tweak something in a particular marathon... The C++ implementation below is taken from TCO10 round 1 solution for Planarity problem. It has exponential cooling schedule dependent on work time.
double SimulatedAnnealing_globalstarttime = gettime();
template<class Sol, class Mut> void SimulatedAnnealing(Sol &solution, double maxt, double mint, double finishattime) {
//synchronization means calculating current temperature from current work time
static const int ITERS_PER_SYNC = 16; //must be power of two
//sometimes we dump current progress to stderr
static const int ITERS_PER_DUMP = 0x40000; //must be power of two
//type of solution score
typedef double ScoreType;
//number of iterations done so far
int iters = 0;
//part of the full cooling schedule passed
double done = 0.0;
//best solution so far
Sol best = solution;
//number of accepted mutations overall
int accmuts = 0;
//time when the SA begins
double starttime = gettime();
//time dedicated for SA processing
double hastime = finishattime  (starttime  SimulatedAnnealing_globalstarttime);
for (; ; iters++) {
//dump stats so that you can watch the progress of SA
if (!(iters & (ITERS_PER_DUMP  1)))
fprintf(stderr, "Iteration:%6d Acc:%6d Temp:%7.3lf Score:%0.5lf\n"
, iters, accmuts, maxt * pow(mint/maxt, done), solution.GetScore());
//synchronize the temperature with time
if (!(iters & (ITERS_PER_SYNC  1))) {
done = (gettime()  starttime) / hastime;
if (done >= 1.0) break;
}
//create mutation for current solution
Mut mut;
mut.Init(solution);
//get the score delta of the mutation
ScoreType delta = mut.GetScore();
//if mutated solution is better, accept it
bool move = false;
if (delta <= 0) move = true;
else {
//otherwise calculate current temperature
double temp = maxt * pow(mint/maxt, done);
//and accept with the tricky probability
move = trandom() < exp(delta / temp);
}
//if mutation is accepted, apply it to the solution
if (move) {
accmuts++;
mut.Apply(solution);
}
//do not forget to store the best solution
if (solution.GetScore() < best.GetScore()) best = solution;
}
//return the best solution as the result
fprintf(stderr, "Simulated annealing made %d iterations (accepted: %d)\n", iters, accmuts);
solution = best;
}
Discussion
You can find some proofs that under certain conditions simulated annealing converges to global optima in some sense. While they are interesting to mathematicians, the cooling schedule requirements in these proofs are too severe. Moreover, the mathematical convergence is analyzed mostly for infinite simulated annealing process. In practice SA often produces a good solution so we can rely on it.
SA is usually applied to discrete optimization problems but it can also work for continuous problems. In this case the mutation procedure should perform continuous local optimization after the perturbation. Or the mutation scale should decrease in the SA process.
if two solutions of your problem can be merged effectively, you can also try applying Genetic Algorithms. Beware: they are much stranger, require order of magnitude more tweaking, and much more black magic to work properly=)
As it was already noted, mutation procedure is often the most important part of SA solution. Mutation should connect the solution space well. Diameter of solution graph should be relatively small. It means that any solution may be generated from any other in reasonably small number of iterations. For example, in TSP with solution space and mutation described above diameter is only V1 while size is V!. It is also bad if transition probabilities for the paths are too small. Mutation procedure should perform some small perturbation so that the delta is relatively small. For example, the mutation example for TSP gives rather small delta (about 4*W = 4*avg(Wij)) and is good. random_shuffle is a very bad mutation since the delta is too large (about V * W) and transition probability is almost equal to zero (despite diameter equal to 1). Also mutating is the most timeconsuming operation, so it should work fast. For example in TSP problem generation, getting delta and applying of mutation can be done in O(1) time. Sometimes it is worthwhile to change mutation during the simulated annealing process. The skill of choosing good mutation is mostly practical. But reading cool books/articles may sometimes help you.
The cooling schedule is also important. It defines the balance between doing completely random mutations (T = inf) and greedy approach (T = 0). Usually the temperature is decreasing, but some researchers tried schedules which allowed increase of temperature. The cooling schedule may also be dependent on the behaviour of the solution. But the simplest approach is to use fixed decreasing cooling schedule. The initial(maximal) temperature should be large enough to make all mutations acceptable with good probability. For example you can try to reach 80% acceptance rate for bad mutations at the maximal temperature. Or you can simply calculate the average delta for mutation and take maximal temperature proportional to it. For example in TSP problem you may set initial temperature equal to 5*W (W = avg(Wij), average edge distance/weight). The terminal(minimal) temperature should be less than the minimal possible positive mutation delta (especially if the score is integer) or less than the precision you want to get. The middle part of cooling schedule can be chosen exponential. In this case the formula is: temp = maxt * pow(mint/maxt, i/N) where maxt, mint are initial and terminal temperatures, i is the number of current iteration, N is the number of iterations total. You can also synchronize the cooling schedule with time limit (look the code above).
The last thing to mention is the initial solution. Since the solution graph is required to be well connected, you may choose the random feasible solution. Also you can apply some other fast heuristic to get initial solution. For example in TSP you can take a nearest neighbor tour. 

Now let's provide more examples of simulated annealing applications. First of all, here is the C++ code for TSP example:
//global variables and constants
const int SIZE = 256;
static const double TIMELIMIT = 9.5;
int n;
double matr[SIZE][SIZE];
struct Solution {
int perm[SIZE];
double score;
void CalcScore() {
//this function is not used in SA
score = 0.0;
for (int i = 0; i<n1; i++) score += matr[perm[i]][perm[i+1]];
score += matr[perm[n1]][perm[0]];
}
double GetScore() { return score; }
};
struct Mutation {
int i, j;
double score;
static int norm(int k) {
if (k < 0) k += n;
if (k >= n) k = n;
return k;
}
void Init(const Solution &s) {
i = rand() % n;
j = rand() % n;
int pi = norm(i1);
int ni = norm(i+1);
int pj = norm(j1);
int nj = norm(j+1);
score = 0;
int *p = (int*)s.perm; //ugly constcast
score = matr[p[pi]][p[i]];
score = matr[p[i]][p[ni]];
score = matr[p[pj]][p[j]];
score = matr[p[j]][p[nj]];
swap(p[i], p[j]);
score += matr[p[pi]][p[i]];
score += matr[p[i]][p[ni]];
score += matr[p[pj]][p[j]];
score += matr[p[j]][p[nj]];
swap(p[i], p[j]);
}
double GetScore() { return score; }
void Apply(Solution &s) {
s.score += score;
swap(s.perm[i], s.perm[j]);
}
};
//getting initial solution with NN
Solution NearestNeighborTour() {
Solution res;
bool used[SIZE] = {false};
used[0] = true;
res.perm[0] = 0;
for (int i = 1; i<n; i++) {
int last = res.perm[i1];
int best = 1;
double minw = 1e+100;
for (int j = 0; j<n; j++) if (!used[j]) {
double tw = matr[last][j];
if (minw > tw) {
minw = tw;
best = j;
}
}
res.perm[i] = best;
used[best] = true;
}
res.CalcScore();
return res;
}
...
//calculating average edge weight
double avgweight = 0.0;
for (int i = 0; i<n; i++)
for (int j = 0; j<n; j++) if (i != j)
avgweight += matr[i][j];
avgweight /= n*(n1);
//running SA
double maxt = avgweight * 10.0;
double mint = avgweight * 0.001;
Solution sol = NearestNeighborTour();
SimulatedAnnealing<Solution, Mutation>(sol, maxt, mint, TIMELIMIT);
...
This mutation is fully O(1) in time complexity. There is a better TSP mutation called 2opt which is more commonly used though its application to solution usually takes O(N) time.
The next example is TCO10 round 1 problem Planarity. The simple mutation is obvious: take random vertex and move it to the random place. Be sure not to waste time on full O(E^2) score recalculation on each iteration. Only about O(E^2 / V) edge pairs should be rechecked for intersection. The random solution is enough as a start. Temperature range may be maxt=E .. mint=0.01. You can try to tweak the mutation to produce better results.
In the TCO10 round 2 problem CellularAutomaton SA also could be used. The solution is a set of changes we would like to apply to the given initial grid. It is convenient to store the changes as a XORmatrix of changes. The initial solution is nochange solution. The mutation is performed by inverting the contests of random cell in the starting grid. Since there is a limit on number of cell changes this mutation will quickly make a solution unfeasible. To fix it we have to consider only mutations that eliminate a change if the number of changes is already equal to maximal allowed (K). The mutation can be optimized greatly if: 1. We store the grid on all the iterations in the solution (not only first grid). 2. We handle mutation as a sequence of square xorfields. The cells after ith iteration that are changed are stored in the ith xorfield. It is obvious that the size of ith xorfield never exceeds (2*i+1) since changes propagate at maximum speed: one cell per iteration. By using this optimization we can reduce mutation time from O(KXY) to O(K^3) which is often less.
The CUDA superhero Challenge 2 problem MaintainPlanes is a clever TSP modification and can be solved with SA. Psyho proved it with excellence=) (read here) 

Very well written. I learned a lot, thank you! The comment about genetic algorithms was very funny and matches well with my personal experience.
Can you provide some intuition as to why you would use ITERS_WARMUP? So essentially you are being greedy at the start then nongreedy and then greedy towards the end of your schedule. I guess this gives you a faster start?
Also I didn't understand what you mean by "synchronization"? 

To be honest, I tried to use GA in TCO01 and TCO02 marathons, and it was actually working (i.e. converging somewhere), but it worked worse than even the simplest SA. The cool thing about GA is that its process is not dependent on time (no temperature like in SA). Maybe GA requires more time to produce good results, I don't know. I would be happy if some "marathon jedi" wrote GA recipe which really showed how to make GA work better than SA.
Below you can read why this time synchronization mechanism was invented. Now it comes to my mind that instead of doing all that ugly things I can simply write:
if (!(iters & (ITERS_PER_SYNC  1)))
done = (gettime()  starttime) / hastime;
I'll investigate how it affects the solution and edit the code=)
Synchronization with time means: temperature = F(work time). Note that the function parameter is work time instead of iteration number. So I want SA to take precisely 9.7 seconds on a marathon with 10 seconds TL=) It is quite strange if SA ends in 5 seconds and other 5 are wasted or if it doesn't have time to end properly. It is useless to restart SA from the very beginning since the first iterations effectively randomize the solution (temperature is very high). During the first ITERS_WARMUP iterations SA works with maximal temperature (it is not greedy, but almost random). These iterations are required to estimate how much time is consumed per one iteration. Then it is assumed that all iterations spend approximately the same time. So if we did K iterations and spent T time on them, the time limit is L, then we can do about K*(LT)/T more iterations. As you see, there is division on T. That's why this "synchronization" scheme is enabled only after several first iterations. This modification is actually doing the start slower. The first ITERS_WARMUP are almost wasted. But I tweak this constant so that the time spent on warmup is about 0.1  0.5 seconds. So I do not lose much=) But I gain stability in the cooling schedule. By the way, the assumption that all iterations take the same time is often wrong. For example, 2opt TSP mutation takes O(1) time to generate and calculate score, but O(V) time to apply. That's why iterations with big temperature will take significantly more time that iterations with small temperature (mutation acceptance ratio is different). But as I wrote above all this ugliness is completely unnecessary=) Thanks for a question! 

This would be a great standalone recipe on SA, but we have a preceding one on hill climbing (that's what you call "greedy", right?), and this recipe has a lot of good points (like connectivity by mutation of solutions space) that can be applied to HC as well.
I suggest that we split your recipe in two parts: the part which can be applied to both SA and HC I'll add to HC recipe (marking you as a coauthor), and SA recipe will concentrate on the differences between SA and HC, choice of cooling schedule, examples of SA etc. This will make this pair of recipes more balanced. 

OK, I wiped that ugly code from SA and edited the post. Now the temperature is simply determined by current time. 

Yeah, hill climbing is called greedy here=)
The terms and definitions can be unified across hill climbing, simulated annealing and genetic algorithms. But not much more. I think it is a bad idea to eliminate any single tip out of this recipe only because it is also true for hill climbing. For example, all the tips about mutation for SA should be in one place, not separated across several recipes, so it would be bad to move the tip about solution space wellconnectivity out of here (though copy is of course OK).
I cannot delete paragraphs "solution space", "goal function" and "mutation" from "Solution" section since in each of them: 1. Simple example for TSP case is given. I feel that it is important to provide simple examples throughout the whole recipe because otherwise it would be too abstract. 2. I define the programming interface of concept. It means the class/method which you have to write to implement the concept. And the C++ SA code uses precisely these definitions (names of classes and methods and their signatures are completely the same). It is convenient for understanding the code.
The best thing I can do is to replace "greedy approach" paragraph with a single reference to your recipe. And reformat SA pseudocode to make it more similar to your HC pseudocode.
Making terms unified now is rather complicated... Here is a list of major differences in terms: 1. state = (feasible) solution. 2. neighbor/transformation = mutation sample/procedure. 3. maximum problem <> minimum problem We can ignore them since the correspondences are clear. Or edit the recipes to unify them completely.
As you see I'm not very happy about the idea of cutting this recipe... Sorry. Is the standalone style of the recipe really bad? 

If hill climbing = greedy SA, ie. a subset of SA then why not remove the recipe on hill climbing completely?
EDIT: I just had a look at the hill climbing recipe and it looks quite different from this one: provides further examples and insights. So now I am leaning towards keeping both recipes. But you guys must make sure that you use a consistent language. 

Thanks for the explanation.
I could never get SA with cooling to work well. Perhaps the biggest difficulty for me is finding the correct cooling schedule: not too fast and not too slow. Instead my preferred version of SA is something like this:
1. Start with an initial solution
For all possible mutations
2. Perform a mutation
3. If score of the mutated solution > 0.9 * score of current solution: keep the mutation
4. If no single mutation increased your score then revert to the previously best solution
0.9 is just a constant and can be changed to anything from 0.7 to 0.99. Note that you can keep mutations in step 3 without ever increasing the score. Hence step 4 is important, as it ensures that you get back to a good state. 

Well... The algorithm you prefer seems to be modified hill climbing but not simulated annealing.
This is because the number 0.9 is constant. I think cooling is essential for SA. Did you try to make this constant change? Something like C = pow(0.7, 1T) or C = 0.7 + 0.3*T.
Also the deterministic nature of transition condition is not very good. There may be some paths in the solution space that you never consider because the maximal regress is more that x0.9. In this case your algorithm has no chance of jumping over this obstacle. SA is also unlikely to overcome this problem, but it can do it. Though the probability to do so decreases if the maximum barrier in solution space increases. On the other hand, if you increase number of cooling iterations, the probability to pass through the barrier increases. So there is always some hope with nondeterministic transition rule=)
The idea to revert to some previous solution is sometimes used in both SA and HC. I've seen it in wikipedia but never tried to use it.
I've never tweaked cooling schedule yet. Just used the exponential schedule. The maximal/minimal temperature can be determined as it is advised in the recipe (based on the scale of mutation score).
Of course I cannot say that your approach is worse than simple SA unless we compare them on some problem. Maybe it is really better?... 
