# Exercise 02 - Genetic Algorithms¶

### Intelligent Software Systems - summer term 2019¶

#### by André Karge¶

Goal of this exercise is to learn and understand genetic algorithms by finding an optimal configuration for highly configurable software systems

## Highly Configurable Software Systems¶

Can be for example the Linux-Kernel or SQLite.

### Linux-Kernel¶

• has ca. 6,000,000 lines of code
• is highly configurable
• > 10,000 configurable options (e.g.: x86, 64bit, ...)
• nearly all source code is optional

### SQLite¶

• is an embedded highly configurable database system
• deployed on over 500 million systems
• provides 88 compiler options for configurations
• suppose we would measure each variant of SQLite:

• 2^88 variants and 5 minutes per measurement (compile + benchmark)

= 2^88 * 5min / 60 (per h) / 24 (per day) / 365 (per year)

= 2,944,111,585,058,457,655,296 years

this is why we use a meta-heuristic, since we don't want to wait that long ;)

## Problem set description for this exercise¶

Highly configurable software systems are systems which provide many features to be selected for creating an individual variant of a program.

However, selecting and deselecting specific features for a configuration can result in efficient or an inefficient configurations of the program.

For the exercise, we want to find an optimal configuration for the h264 and bdbc systems.

#### h264¶

h264 is a block-oriented motion-compensation-based video compression standard (source).

#### bdbc¶

Berkeley DB (bdb) is a software library intended to provide a high-performance embedded database for key/value data (source). The given dataset is based on the c-implementation of bdb.

## Problem sets¶

• Each problem set consists of 2 files: features & interactions

### Feature file definition¶

• each line represents a feature of the software system
• the value at the end of the line represents the performance value of the respective feature if enabled
• for example "have_crypto: 0.0120768614910988" means that a version of the program can have a cryptographic function enabled and the feature influences the performance by 0.0120768614910988

### Interaction file definition¶

• each line represents a specific feature combination
• a line consists of at least 2 features separated by a "#" and a performance value for this specific selection
• for example: "CS32MB#DIAGNOSTIC: 21.0214021572657" means if CS32MB and DIAGNOSTIC are enabled, this affects the configuration performance by 21.0214021572657
In [46]:
bdbc_f = "code/datasets/bdbc_feature.txt"
bdbc_i = "code/datasets/bdbc_interactions.txt"
h264_f = "code/datasets/h264_feature.txt"
h264_i = "code/datasets/h264_interactions.txt"

path = bdbc_f
#path = bdbc_i
#path = h264_f
#path = h264_i

with open(path) as f:
print(content)

root: 4.29699011265453
HAVE_CRYPTO: 0.0120768614910988
HAVE_HASH: 4.62346632789263
HAVE_REPLICATION: 0.274902720516604
HAVE_VERIFY: 0.0534583494140551
HAVE_SEQUENCE: 0.468274246157709
HAVE_STATISTICS: 0.422176987829888
DIAGNOSTIC: 0.00352722010086667
PS1K: 0.789118194698634
PS4K: 0.505426563786175
PS8K: 1.69334751326048
PS16K: 4.45418742564813
PS32K: 0.746689251767928
CS32MB: 3.86431506487224
CS16MB: 2.95905633060594
CS64MB: 2.58412079773263
CS512MB: 5.34132772539305



### Create a fitness function¶

• Build a fitness assessment function using the given files

feature file:

root: 4.29699011265453

HAVE_CRYPTO: 0.0120768614910988

HAVE_HASH: 4.62346632789263

HAVE_REPLICATION: 0.274902720516604

...

interaction file:

PS16K#CS512MB: -4.05239323580753

PS1K#HAVE_REPLICATION: -0.732537201572133

HAVE_CRYPTO#PS8K: -0.833090135903526

HAVE_REPLICATION#PS16K#DIAGNOSTIC: 31.7326205862695

...

fitness function:

4.29 + HAVE_CRYPTO 0.01 + HAVE_HASH 4.63 + HAVE_REPLICATION * 0.27 + ... +

PS16K CS512MB (-4.05) + PS1K HAVE_REPLICATION (- .73) + ...

• write your genetic algorithm by using the framework given in file run_genetic_alg.py

How to build a genetic algorithm? See lecture nodes for that.

How to model an individual candidate solution?

• Use a feature vector with the length of all possible features
• represent the selection of a feature as either 0 (deselected) or 1 (selected)

For example when selecting only the feature "root" in bdbc:

[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

The fitness of this candidate solution is: 4.29699011265453

Compare your Algorithm with these values (my best performing solutions):

• fitness(bdbc) $\rightarrow$ 154.96
• fitness(h264) $\rightarrow$ 3918.64