Parallelizing a Game AI Engine: Root-Level Optimization

Sunday, 22 February 2026 | Pavan Kumar S G

As part of my Season of KDE work on the Mankala game engine, I am trying to impleemnt root level parallelization to speed up the AI's move evaluation.

Here's how I achieved about 2x speedup.

The Problem: Sequential Move Evaluation

The Mankala AI uses the minimax algorithm to evaluate moves, searching 7 levels deep into the game tree. At each level, the tree branches into 6 possible moves, creating ~280,000 positions to evaluate per search. On my Intel i5-11260H system, a single move evaluation took 0.003-0.007 seconds sequentially.

While this seems fast, it adds up during gameplay and prevents deeper searches that would make the AI stronger. The solution? Parallelize the move evaluation.

The Approach: Root-Level Parallelism

Root-level parallelism is elegantly simple: evaluate each top-level move on a separate thread. Instead of evaluating moves 0, 1, 2, 3, 4, 5 sequentially, we evaluate them in parallel:

Sequential:                  Parallel (8 threads):
Move 0 → [search tree]      Thread 0: Move 0 → [search tree]
Move 1 → [search tree]      Thread 1: Move 1 → [search tree]
Move 2 → [search tree]      Thread 2: Move 2 → [search tree]
Move 3 → [search tree]      Thread 3: Move 3 → [search tree]
Move 4 → [search tree]      Thread 4: Move 4 → [search tree]
Move 5 → [search tree]      Thread 5: Move 5 → [search tree]

This works because each move creates a completely independent search tree—no shared state, no synchronization needed during the search itself.

Implementation: Five Threading Models

Rather than just implementing one approach, I decided to compare five different threading models to see which performed best and which was easiest to maintain.

1. OpenMP: The Simplest Solution

OpenMP turned out to be the winner for simplicity. A single pragma directive parallelizes the entire loop:

int parallelRootMiniMax(Player player, const Rules& rules, 
                       const Board& state, int num_threads) {
    std::vector<int> moves = rules.getMoves(player, state);
    
    int best_move = -1;
    int best_score = N_INFINITY;
    
    #pragma omp parallel for schedule(dynamic) num_threads(num_threads)
    for (size_t i = 0; i < moves.size(); ++i) {
        Board new_state = state;
        rules.move(moves[i], player, new_state);
        
        Table table;
        int score = alphaBeta(next_player, rules, new_state,
                             DEPTH - 1, N_INFINITY, P_INFINITY, table).eval;
        
        #pragma omp critical
        {
            if (score > best_score) {
                best_score = score;
                best_move = moves[i];
            }
        }
    }
    
    return best_move;
}

The schedule(dynamic) clause is crucial—it ensures threads grab moves as they finish, automatically balancing the load when some moves take longer to evaluate than others.

2. pthreads: Manual Control

For comparison, I implemented the same logic using POSIX threads. This required significantly more code—about 60 additional lines for thread creation, data passing, and result collection. While it gave fine-grained control, the performance was nearly identical to OpenMP.

3. std::thread: Modern C++

The C++11 std::thread approach provided a cleaner interface than pthreads while maintaining similar performance. I used an atomic counter for dynamic work distribution:

std::atomic<size_t> next_move_idx{0};

auto worker = [&]() {
    while (true) {
        size_t idx = next_move_idx.fetch_add(1);
        if (idx >= moves.size()) break;
        
        // Evaluate move[idx]...
    }
};

4. Taskflow: Task-Based Parallelism

Taskflow provided the highest-level abstraction. I am still in the progress of implementing Taskflow.

5. Sequential: The Baseline

The sequential version served as our performance baseline, ensuring we could accurately measure speedup.

The Results: Performance Analysis

Testing on my 12-core Intel i5-11260H system inside the Fedora container:

Thread Scaling (OpenMP)

| Threads | Time (s) | Speedup | Efficiency | |---------|----------|---------|------------| | 1 | 0.007 | 0.42x | 42.3% | | 2 | 0.004 | 0.74x | 36.8% | | 4 | 0.003 | 1.19x | 29.7% | | 8 | 0.002 | 2.20x | 27.5% |

Method Comparison (8 threads)

| Method | Time (s) | Speedup | |-------------|----------|---------| | Sequential | 0.004 | 1.00x | | OpenMP | 0.002 | 2.20x | | pthreads | 0.003 | 1.50x | | std::thread | 0.003 | 1.60x |

Understanding the Numbers

The speedup of 2.20x with 8 threads might seem lower than the theoretical maximum of 8x, but there are good reasons:

Very Fast Searches: At 0.003 seconds, thread overhead dominates
Limited Parallelism: Only 6 moves to evaluate (not enough work for 8 threads)
Amdahl's Law: Sequential portions (setup, result collection) limit speedup

Lessons Learned

1. Simplicity Wins

OpenMP's single pragma achieved the same performance as 60 lines of manual pthread code. Unless you need fine-grained control, use the simplest approach.

2. Dynamic Scheduling Matters

Move evaluation times vary significantly. Static work distribution led to load imbalance, while dynamic scheduling kept all threads busy.

3. Thread-Local Data is Key

Each thread needs its own transposition table. Sharing a table required locking, which killed performance.

4. Measure Everything

My initial assumption was that pthreads would be fastest due to lower overhead. OpenMP actually performed better due to its sophisticated runtime.

5. Context Matters

Root-level parallelism is perfect for Mankala's characteristics (6 moves, moderate depth). For games with different properties, other approaches might work better.

Future Optimizations

While root-level parallelism is sufficient for Mankala, there are advanced techniques that could push performance further:

Parallel Iterative Deepening: Search multiple depths simultaneously
Lazy SMP: Multiple threads with shared transposition table (used in Stockfish)

These techniques add significant complexity but can provide 5-10x additional speedups for more complex games.

Conclusion

Implementing root-level parallelism in the Mankala engine achieved a 2.2x speedup with minimal code complexity.

The key takeaways:

Root-level parallelism is simple and effective for game AI
OpenMP provides the best balance of simplicity and performance
Always measure—performance intuitions are often wrong

The parallelized engine is now faster and more responsive, making the Mankala game more enjoyable to play. And the best part? The code remains simple and maintainable.