"How long a chain of overlapping movie titles, like Sling Blade Runner, can you find?"

Use the following listing of movie titles: MOVIES.LST. Multi-word overlaps, as in "License to Kill a Mockingbird," are allowed. The same title may not be used more than once in a solution. Heuristic solutions that may not always produce the greatest number of titles will be accepted: seek a reasonable tradeoff of efficiency and optimality.

The problem description itself is a little vague leading to different goals depending upon how you read it. I choose to count the length by the number of movie titles that were used rather then words or character.

The first part of the problem is just reading in the data (the data file has its own issues for you to solve). Just from googling you will find several blogs where the discussion is mostly all about just reading in the data and running a depth search. This finds chain lengths of around 220 very quickly and while nice doesn't attack the real problem of the algorithm.

Once you read in the data at the core you end up with a list of nodes. Each node contains a list of nodes that it points to (and optionally a list of nodes that point to it). In other words your challenge if you accept it is to find the longest path in a directed graph.

My solution is built upon two algorithms. The first algorithm takes a starting node and produces a list of long paths. This algorithm wont guarantee the return of the longest path for a source node, but it will return very long paths in a very short amount of time. The pseudo-code version of the first algorithm in (called findLongChain in the code):

Using just this algorithm a respectable result can be found.

set1 is populated with a source node

while (set1 is not empty)

for each node in set1 all paths are followed.

If the path to the next node is more then the

currently known length to that node it is

stored and that node is added to set2.

set1 = set2

These long paths can be sent through a second algorithm which attempts to increase a chain length by finding local longer paths using a limited depth search. The pseudo-code version of the second algorithm (called expandChain in the code):

for (i = 0; i < chain.length; ++i)

Starting at i do a depth first search with a max depth

of X where X is a relatively small value to find a longer

path then the current chain[i] -> chain[i + X]. If a longer

path or subpath is found replace it in the chain.

Once the graph is constructed each node is passed to findLongChain which runs its best chain through expandChain before returning it. Once each node has been run through findLongChain the root node of the best chain is run through findLongChain, but this time all the best paths are run through expandChain to see if there is anything longer to find.

Because findLongChain is completely stand alone I was able to take code that was originally something like this:

Chain longChain;

for (int i = 0; i < todo.count(); ++i) {

Chain chain = findLongChain(todo[i]);

if (chain.count() > longChain.count())

longChain = chain;

}

and change it to use QtConcurrent's mapped reduce function to take fully advantage of my dual and quad core computers. The ease that QtConcurrent lets you scale your solution is fantastic.

Chain longChain = QtConcurrent::mappedReduced(todo, findLongChain, reduce);

While hacking on my solution I came across Jeremy Faller's website who also had been looking at the problem and was able to produce good results. (our approaches were completely different and we didn't see each others code until after we had finished). I took his longest length chain of 318 and ran it through my expandChain function to get a chain with length 326. A very respectable result found by the combination of our two programs.

The source code for my solution can be found on github in my sling blade runner repository.

After compiling run './sbr' and in around ten minutes on a quad core (subsequent runs last around five minutes). It will print out the best solution that was found which unless you change the code was a length of 312. If you uncomment the DEBUG define at the top of main.cpp the progress through the nodes will be printed as well as the best chain as it is found.

Hacking on this problem has been a lot of fun. Because there is no right solution I was free to try out all sorts of different approaches and spend time reading up on papers on graphs (the ones I didn't have to pay for). Spent some time messing around with Qt::concurrent and how fast I could get it to run. While working on the problem it was fun and motivating as I was able to pass everyone else's (the ones I found via Google) longest chain. The best score found by my code is 313, but I am more happy about the combined length of 326.

## No comments:

Post a Comment