MEPX software

version 2017.10.15.0-beta

Home
Description
Papers
Source code
MEPX Software
People
Links
Contact
Download User manual What is new

How to install Multi Expression Programming X

Windows 64bit

Just download the program from here:

mepx_win64.zip (2.3 MB)

unzip the archive and run the mepx.exe application. There is no installation kit. Please remember where you saved it so that you can run it next time.

Apple MacOS/OSX (64bit) (version 10.9 or newer)

Download the program from here:

mepx_osx.zip (4.4 MB)

It is .zip archive. Double-click it in Finder. It should be decompressed in the same folder as the zip archive. Open it from there (you should right click the icon and choose Open command).

Ubuntu (64bit) (tested on Ubuntu 14 and 15)

Download the program from here:

mepx64.deb (1.6 MB)

Install the program with Ubuntu Software Center.

Open a Terminal and run the following command (this will display icons on buttons - which are disabled by default):

gsettings set org.gnome.settings-daemon.plugins.xsettings overrides "{'Gtk/ButtonImages': <1>, 'Gtk/MenuImages': <1>}"

That is all.

mepx is the name of the program, so each time you want to start it, just open a Terminal and type mepx

Test projects (taken from PROBEN1 and other datasets) can be downloaded from: MEPX test projects on Github. Just download a .xml file and press the Load project button from MEPX to load it.

User manual

Quick start

  • Select Data panel.
  • Select Training data panel.
  • Press Load training data button and choose a csv or txt file. Data must be separated by blank space, tab or ;.
  • Select Parameters panel. Modify some parameters if needed. For instance, one could modify code length, numer of subpopulations, the (sub)population size, number of generations etc. Also specify the problem type (regression or binary classification).
  • Press Start button from the main toolbar.
  • Read the results from Results panel.
  • You can also save the entire project (data, parameters, results) by pressing the Save project from the main toolbar.

Data

Data are loaded from csv or txt files. Data must be separated by blank space, tab or ;.

Last value on each line is the target (expected output). Test data can be without output (they may have one column less than training data).

Currently the problems can have only 1 output (see below an exception for classication problems). Files containing multiple outputs must be split accordingly (for instance Building problem from PROBEN1 which has 3 outputs (energy, hot and cold water)).

For classification problems, the last column may contain only values 0 or 1 (for binary classification) or values 0,1 ... (num_classes - 1) for more multi-class classification.

It is also possible that the output for classification problems to be given in One-of-m format. For instance if the problem has 5 classes, the output will have 5 values, one of them being set to 1 and all others being set to 0. This type of format is loaded from files with dt extension.

Training data is compulsory. The others (validation and test) are optional.

You can also load alphanumerical values and then convert them to numerical values. You have several specialised buttons for that:

  • Replace values - which will replace some values (alphanumerical for instance with numerical). Find and replace works with regular expressions too.
  • To numeric - which will do an automatic conversion of alphanumerical values to integer values. First alphanumerical value will be converted to 0, the second (distinct one) to one and so on.

The user can also scale numerical values to a given interval.

Parameters

Fitness function

Fitness (or the error) is computed as follows:

For symbolic regression problems, the fitness is either Mean Absolute Error (sum of errors divided by the number of examples) or Mean Squared Error (sum of squared error divided by the number of examples).

For classification problems the fitness is computed in multiple ways depending on problem or strategy. However what we report, in the resulted tables, is the percentage of incorrectly classified data (the number of incorrectly classified examples divided by the number of examples and multiplied by 100).

Problem type

Can be:

  • regression
  • ,
  • binary classification (with 2 classes)
  • ,
  • multi-class classification (with 2 or more classes)
  • .

A problem with 2 classes can be solved by selecting either binary classification or multi-class classification.

Binary classification uses a threshold for makind distimction between classes. Values less or equal to the threshold are classified as belonging to class 0 and the others are classified as belonging to class 1.

In the case of binary classification, the threshold is computed automatically (because of that, binary classification can be slower sometimes).

For multi-class classification there are 3 strategies:

  • Winner takes all - fixed positions -the outputs are assigned to groups of genes and the gene encoding the expression having the first maximal value will provide class for that data (see more details here: google groups post).

  • ,
  • Winner takes all - fixed positions - smooth fitness
  • ,
  • Winner takes all - best genes
  • .

If use validation set is checked then, at each generation, the best individual is run against the validation set, and the best such individual (from those tested against the validation set) is the output of the program (and will be applied on the test data).

It is possible to run the optimization on a smaller set of training data. In such case you have to set the Random subset size to a value smaller than the size of the training set. The set is changed after Num generations for which random subset is kept fixed.

Operators (or functions)

Classic operators +, -, *, ... nothing new here.

Note that trigonometric operators work with radians.

The algorithm

MEPX uses a steady state model with multiple subpopulations. Steady-state means that inside one subpopulation, the worst individuals are replaced with newer ones (if the newer are better).

User may specify the number of subpopulations. Each subpopulation will run independently from the others and, after one generation, they will exchange few individuals.

Genetic operators (crossover and mutation) are classic ... nothing new here.

It is possible to specify how often the variables, operators and constants should appear in a chromosome. This is done probabilistically. If you want more operators to appear, please increase the operators probability. More operators means more complex expressions.

Sum of operators probability, variables probability and constants probability must be 1.

Constants

In order to enable constants, one must define a probability greater than 0 for constants. You cannot edit that probability directly, but constants_probability + operators_probability + variables_probability = 1. So if you define a value for probability for operators or variables such that their sum is less than 0, you will get a greater than 0 value for constants.

Constants can be user defined or generated by the program (over a given interval). Generated constants can be kept fixed for all the evolution or they can also evolve. Mutation of constants is done by adding a random value between [-max delta, +max_delta].

Runs

Usually multiple runs must be performed for computing some statistics. It is also possible to specify the initial seed of the first run (consecutive runs will start from previous seed + 1).

Num threads - will run the subpopulations on multiple CPU cores. This can increase the speed of analysis significantly. If you have a quad core processor with hyper-threading, you may set the number of threads to 8. For best results make sure that the number of subpopulations is a multiple of number of threads.

Results

The following results are displayed:

  • error for the entire training, validation and test set. In the case of classification problems we display the number of incorrectly classified data in percent.
  • obtained value for each data in the training, validation and test set (also called Model or Output).
  • evolution of fitness (this can be different from the number of incorrectly classified data) for the best individual in the population and the population average.
  • C source code of the best solution. This code can be simplified in order to show only instructions that generate the output (remember that not all genes of a chromosome participate to the solution - these genes are called introns). Note that there is no simplification in the case of multi-class classification.

Reporting problems, bugs, comments

If you have problems with this program please save the project (by pressing the Save Project button from the main toolbar) and send it to mihai.oltean@gmail.com