Exercise 02 - Genetic Algorithms

Intelligent Software Systems - summer term 2019

by André Karge

Goal of this exercise is to learn and understand genetic algorithms by finding an optimal configuration for highly configurable software systems

Highly Configurable Software Systems

Can be for example the Linux-Kernel or SQLite.

Linux-Kernel

  • has ca. 6,000,000 lines of code
  • is highly configurable
    • > 10,000 configurable options (e.g.: x86, 64bit, ...)
    • nearly all source code is optional

SQLite

  • is an embedded highly configurable database system
    • deployed on over 500 million systems
    • provides 88 compiler options for configurations
  • suppose we would measure each variant of SQLite:

    • 2^88 variants and 5 minutes per measurement (compile + benchmark)

      = 2^88 * 5min / 60 (per h) / 24 (per day) / 365 (per year)

      = 2,944,111,585,058,457,655,296 years

this is why we use a meta-heuristic, since we don't want to wait that long ;)

Problem set description for this exercise

Highly configurable software systems are systems which provide many features to be selected for creating an individual variant of a program.

However, selecting and deselecting specific features for a configuration can result in efficient or an inefficient configurations of the program.

For the exercise, we want to find an optimal configuration for the h264 and bdbc systems.

h264

h264 is a block-oriented motion-compensation-based video compression standard (source).

bdbc

Berkeley DB (bdb) is a software library intended to provide a high-performance embedded database for key/value data (source). The given dataset is based on the c-implementation of bdb.

Problem sets

  • Each problem set consists of 2 files: features & interactions

Feature file definition

  • each line represents a feature of the software system
  • the value at the end of the line represents the performance value of the respective feature if enabled
  • for example "have_crypto: 0.0120768614910988" means that a version of the program can have a cryptographic function enabled and the feature influences the performance by 0.0120768614910988

Interaction file definition

  • each line represents a specific feature combination
  • a line consists of at least 2 features separated by a "#" and a performance value for this specific selection
  • for example: "CS32MB#DIAGNOSTIC: 21.0214021572657" means if CS32MB and DIAGNOSTIC are enabled, this affects the configuration performance by 21.0214021572657
In [46]:
bdbc_f = "code/datasets/bdbc_feature.txt"
bdbc_i = "code/datasets/bdbc_interactions.txt"
h264_f = "code/datasets/h264_feature.txt"
h264_i = "code/datasets/h264_interactions.txt"

path = bdbc_f
#path = bdbc_i
#path = h264_f
#path = h264_i

with open(path) as f:
    content = f.read()
print(content)
root: 4.29699011265453
HAVE_CRYPTO: 0.0120768614910988
HAVE_HASH: 4.62346632789263
HAVE_REPLICATION: 0.274902720516604
HAVE_VERIFY: 0.0534583494140551
HAVE_SEQUENCE: 0.468274246157709
HAVE_STATISTICS: 0.422176987829888
DIAGNOSTIC: 0.00352722010086667
PS1K: 0.789118194698634
PS4K: 0.505426563786175
PS8K: 1.69334751326048
PS16K: 4.45418742564813
PS32K: 0.746689251767928
CS32MB: 3.86431506487224
CS16MB: 2.95905633060594
CS64MB: 2.58412079773263
CS512MB: 5.34132772539305

Aim of this exercise: find an optimal configuration (= high performance | max performance value)

Create a fitness function

  • Build a fitness assessment function using the given files

feature file:

root: 4.29699011265453

HAVE_CRYPTO: 0.0120768614910988

HAVE_HASH: 4.62346632789263

HAVE_REPLICATION: 0.274902720516604

...


interaction file:

PS16K#CS512MB: -4.05239323580753

PS1K#HAVE_REPLICATION: -0.732537201572133

HAVE_CRYPTO#PS8K: -0.833090135903526

HAVE_REPLICATION#PS16K#DIAGNOSTIC: 31.7326205862695

...


fitness function:

4.29 + HAVE_CRYPTO 0.01 + HAVE_HASH 4.63 + HAVE_REPLICATION * 0.27 + ... +

PS16K CS512MB (-4.05) + PS1K HAVE_REPLICATION (- .73) + ...

  • write your genetic algorithm by using the framework given in file run_genetic_alg.py

How to build a genetic algorithm? See lecture nodes for that.

How to model an individual candidate solution?

  • Use a feature vector with the length of all possible features
  • represent the selection of a feature as either 0 (deselected) or 1 (selected)

For example when selecting only the feature "root" in bdbc:

[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

The fitness of this candidate solution is: 4.29699011265453


Compare your Algorithm with these values (my best performing solutions):

  • fitness(bdbc) $ \rightarrow $ 154.96
  • fitness(h264) $ \rightarrow $ 3918.64