Tutorial

This tutorial will show you how to:

create a new DSL from scratch;
add a semantic to the DSL;

afterward everything is PBE specific:

add a DSL to the already existing PBE pipeline;
create your first dataset for this DSL;
explore a dataset;
create a task generator for that pipeline;
generate synthetics datasets;
train a model;
evaluate a model;
synthesize a program.

This example is the calculator DSL, whose source code can be found in the folder ./examples/pbe/calculator.

Create a DSL from scratch

A DSL is a syntactic object thus it only defines the syntax of our primitives. The relevant file is calculator/calculator.py.

A primitive is a function or a constant that you might need to use in the solution program, it is typed and usually has a semantic, but the semantic is not defined in the DSL.

The syntax is a mapping from primitives to their type.

For detailed information about types see the page on the type system.

The syntax object is a dictionnary where keys are unique strings identifying your primitives and values are ProgSynth types. It might be a bit long to explain all the different type features supported by ProgSynth, however ProgSynth provides the auto_type function which dramatically speed up the syntax writing process. Here is an example:

from synth.syntax import auto_type, DSL

syntax = auto_type({
  "1": "int",
  "2": "int",
  "3": "int",
  "3.0": "float"
  "int2float": "int -> float",
  "+": "'a [int | float] -> 'a [int | float] -> 'a [int | float]",
  "-": "'a [int | float] -> 'a [int | float] -> 'a [int | float]",
})

dsl = DSL(syntax)

The notation might seem complex to you but we will briefly explain what happens. When you put a string such as "int" or "float then this is transformed into a ground type, the "->" translates into a function. The "'" prefix tells us "'a" is a polymorphic type; however the [int | float] right after that indicates this polymorphic type can only take the following values: int or float. So that means we will have a + only for int and a + only for float, but both will share the same semantic, since they both are named "+". For detailed information about types and on how this works see the page on the type system.

You can now use your DSL to generate grammars!

You might want to add syntactic constraints on the generated grammars, this is covered in sharpening.

Add a semantic to the DSL

The relevant file is calculator/calculator.py. It’s great that we can produce grammars and everything with our DSL but we cannot execute our program! It is time to gave them a semantic! The semantic object is a dictionnary where keys are unique strings identifying your primitives and values are unary functions or constants.

Here is the semantic for the primitives we defined earlier:

from synth.semantic import DSLEvaluator

semantic = {
    "+": lambda a: lambda b: round(a + b, 1),
    "-": lambda a: lambda b: round(a - b, 1),
    "int2float": lambda a: float(a),
    "1": 1,
    "2": 2,
    "3": 3,
    "3.0": 3.0,
}

evaluator = DSLEvaluator(semantic)

First for constants, they are just associated to their value. Then for functions, notice that while + is a binary function, here we have a unary function that returns another unary function. ProgSynth needs functions in unary form in order to be able to do partial applications. Python’s system to automatically transform a n-ary function to a unary function as of now induces a relatively high execution cost, which makes it prohibitive for ProgSynth.

You can now use your evaluator to eval your program, the syntax is evaluator.eval(program, inputs_as_a_list).

As a side note, it might happen that in your evaluation, exceptions occur and you do not want to interrupt the python process, in that case you can use evaluator.skip_exceptions.add(My_Exception). When such an exception occurs, it is caught and instead a None is returned.

The evaluator cached the evaluation of programs, so the value is computed only once on the same input. However, in some cases, you might need to clear the cache since it can take a lot of space which can be done using: evaluator.clear_cache().

Everything after is PBE specific.

Making your DSL usable by scripts

Most if not all scripts in the pbe folder should work with little to no changes for most DSLs. These scripts use the dsl_loader.py file that manages DSLs and provides a streamline approach for all scripts to load and use them. You should add your DSL to that script to be able to use all these scripts for free.

But since this is PBE specific we need to define a lexicon in calculator/calculator.py.

Lexicon

In the PBE specification, a lexicon is needed in order to:

create synthetic tasks and thus synthetic datasets;
use neural networks for prediction.

A lexicon is a list of all base values that can be encountered in the DSL. Here, we limit our DSL to float numbers rounded to one decimal, in the range [-256.0, 257[. For example, if our DSL were to manipulate lists of int or float, we would not have to add anything to the lexicon since lists are not a base type (PrimitiveType).

Finally adding your DSL

Your only point of interest in this file is the __dsl_funcs dictionnary that should be surrounded by comments. Here is the line that we added to the dictionnary for our calculator DSL:

"calculator": __base_loader(
        "calculator.calculator",
        [
            "dsl",
            "evaluator",
            "lexicon",
            ("reproduce_calculator_dataset", "reproduce_dataset"),
        ],
    ),

It tells the loader that the DSL is defined in the file calculator/calculator.py, then when it loads this file, it loads the following variables dsl, evaluator, lexicon, reproduce_calculator_dataset. These variables will be made available under the following fields respectively dsl, evaluator, lexicon, reproduce_dataset. Notice that the tuple notation allows renaming. The first three are necessary while the last one is optional, in the sense that you might not need to redefine a reproduce_dataset function.

Creating a dataset

The relevant file is calculator/convert_calculator.py.

To generate a synthethic dataset we need to create a dataset. For this example, we created a short JSON file named dataset/calculator_dataset.json that is built with the following fields:

program: that contains the representation of the program, the parsing is done automatically by the DSL object (dsl.parse) so you don’t need to parse it yourself. Here is a representation of a program that computes f(x, y)= x + y * x in our DSL: (+ var0 (* var1 var0));
examples: displaying what are the expected inputs and outputs of the program.

Once the dataset is done, we need to create a file converting it to the ProgSynth format, done here in convert_calculator.py. An important point to note is that we need to develop the PolymorphicType, since our + and - depend on it so before parsing we need to call dsl.instantiate_polymorphic_types().

If you want to adapt the code of calculator/convert_calculator for your own custom DSL, it should work almost out of the box with ProgSynth, note that ProgSynth needs to guess your type request and it does so from your examples. If you are manipulating types that are not guessed by ProgSynth, it wil fill them with UnknownType silently, in that case you may need to add your own function to guess type request or modify the one from ProgSynth which is synth/syntax/type_helper.py@guess_type.

We can simply use this file by command line, from the folder ./examples/pbe/calculator.

python convert_calculator.py dataset/calculator_dataset.json -o calculator.pickle

Explore a dataset

You might want to check that you correctly translated your task to the ProgSynth format. This can be done easily by visualizing the tasks of a dataset with the dataset explorer. A dataset can be explored using dataset_explorer.py.

python examples/pbe/dataset_explorer.py --dsl calculator --dataset calculator.pickle

Creating a Task Generator

Most often you don’t need to use a custom TaskGenerator and the default one will work, however if you have more than one ground type you will need to do so, this is the case with the calculator DSL. The code is at the end of calculator/calculator.py.

TODO: explain in more details

Generating a synthetic dataset

Now we can create synthetic datasets. There is already existing script that does all of the job for us.

The dataset generator works out of the box for our DSL but that may not always be the case, you can check out other DSLs files and look at the task_generator_*.py files.

You can generate datasets using:

python examples/pbe/dataset_generator_unique.py --dsl calculator --dataset calculator/calculator.pickle -o dataset.pickle --inputs 1 --programs 1000

Train a model

For more information about model creation see this page.

You can easily train a model using:

python examples/pbe/model_trainer.py --dsl calculator --dataset my_train_dataset.pickle --seed 42 --b 32 -o my_model.pt -e 2

There are various options to configure your model and everything which we do not dwelve into.

Infer with a model

A model can be used to produce PCFGs, this will produce a pickle file in the same folder as your model, you will need to pass this file to the solver.

python examples/pbe/model_prediction.py --dsl calculator --dataset my_test_dataset.pickle --model my_model.pt --b 32 -support my_train_dataset.pickle

The --support my_train_dataset.pickle is only used to filter the test set on type requests that were also present in the train set.

Evaluate a model

You might want to evaluate a model to see if it learned anything relevant, this can be easily done but is time consuming. To evaluate a model, we actually try to solve program synthesis tasks for a DSL so this is not simply an inference task. If you are directly interested in synthesizing your first program then jump over to the next section which tells you exactly how to do that. You can easily evaluate a model using:

python examples/pbe/solve.py --dsl calculator --dataset my_test_dataset.pickle --pcfg pcfgs_my_test_dataset_my_model.pt -o . -t 60 --support my_train_dataset.pickle --solver cutoff

The most important parameter is perhaps -t 60 which gives a timeout of 60 seconds per task. You can also play with different solver, by default cutoff works pretty well on almost anything.

This will produce a CSV file in the output folder (. above). This result file can then be plotted using:

python examples/plot_solve_results.py --dataset my_test_dataset.pickle --folder . --support my_train_dataset.pickle

Again there’s a plethora of options available, so feel free to play with them.

Simple synthesis

Here is a simple function that takes your task, the PCFG and the evaluator and generates a synthetised program. For more information about predictions and how to produce a P(U)CFG from a model, see this page. If you are perhaps more interested in solving then you should probably look at the files in synth.pbe.solvers which offer different ways of solving our synthesis problem.

from synth import Task, PBE
from synth.semantic import DSLEvaluator
from synth.syntax import bps_enumerate_prob_grammar, ProbDetGrammar
from synth.pbe import CutoffPBESolver

def synthesis(
    evaluator: DSLEvaluator,
    task: Task[PBE],
    pcfg: ProbDetGrammar,
    task_timeout: float = 60
):
    solver = CutoffPBESolver(evaluator)
    solution_generator = solver.solve(task, bps_enumerate_prob_grammar(pcfg), task_timeout)
    try:
        solution = next(solution_generator)
        print("Solution:", solution)
    except StopIteration:
        # Failed generating a solution
        print("No solution found under timeout")
    for stats in solver.available_stats():
        print(f"\t{stats}: {solver.get_stats(stats)}")