Sunday, May 31, 2015

Algebraic Data Types

So about a month ago, I posted about the addition of product types to Tangent. These allow you to combine data together into things that look like classes. Well today is the other side of that coin, sum types. These let your data be either one type or another. Together, product types and sum types provide what is known as Algebraic Data Types, which functional languages have long touted for their expressiveness. Types can then be defined like "Foo is a string and a datetime or a string and a timespan".

The example test code for today:

int list :> int | (a: int),(b: int list) {
  (this).head => int { a }
  (this).tail => int list { b }
  print (this) => void {
    print this.head;
    print this.tail;

print (il: int list) => void {}

entrypoint => void {
  print 1,2,3;

This defines a nice simple (non-empty) list of integers and functions to print the list. The pipe ('|') is used as the type constructor for sum types in type declarations. So the declaration above is read as "int list is either an int or an int and an int list". The recursive type then cascades out indefinitely. You'll find this sort of thing all over Lisp and other functional languages. Here, I've defined the comma as the list constructor. And because of how Tangent works, list constructs are directly usable in code (as in the entrypoint). No boatloads of parens necessary like in Lisp. No need to extend the syntax like C# and other languages with special list initialization syntax. You can do it right in-language with Tangent.

Alas, there's still some cruft around things. First, is the need for the empty print function for int lists. Right now there's no real code for working with sum types. The only thing that is there is specialization logic like there is for single value types. So the "base" function is needed for the code to compile against, but the compiler is smart enough to dispatch calls to more specific versions of the function (int and the anonymous product type in this case). Not great.

Also not great is the passthroughs for head and tail on the anonymous product type. I'd like to have some syntax that would automatically re-expose constructor parameters, but I'm not sure what it is, or what it would look like.

And lastly, this is a pretty specific implementation. Lists should be generic. I have a good idea about what the syntax will look for that, and I expect that stuff to be the next large chunk of code worked on.

Why are these things here you ask? Sure, this sort of code seems pretty academic. After all, I don't expect people to use these lists instead of .NET's List. The main motivation for these things is so I don't have null in the language. The pains of null are well documented, and I'm curious how thing would work in a world where you need explicit nullable types. Beyond that, I hope that with sum types I don't need to implement exceptions. If your function can return something or an error, use a sum type that says that explicitly. That's very like Java checked exceptions, which suck hard, but I'm curious if the language flexibility can't lessen that pain in practice.

Anyways, these are now available in source control.

One bug I ran into was the lack of type checking in CIL. It happily sent the wrong parameter into a function and the function happily ran it, leading to badness. Similarly, if you don't unbox value types, they too will happily run the function on the wrong bits of data. Something to remember in the future since it was troublesome to track down. I also imagine that I could use such behavior in hacky terrible ways should they be necessary.

Also, sum types were a pain to implement. Product types are fairly simple, since the "does the type of that match what we need" logic can work the same as it did with simple types. Sum types required fairly distributed changes to a lot of different code since they add not only a selection step to type matching, but a layer of hierarchy to the type structure (though that was maybe since I insisted on anonymous types like above). It was interesting to see, and may be why so few languages implement them.

Saturday, May 9, 2015

Order of Operations

So I've finally gotten this iteration of Tangent to the point where I can start to see if it actually does what I think it should do. As you might expect, it does not.

I've gotten basic classes working enough that I can start making actual arithmetic with type shenanigans to coerce things to take the right order of operations. Here is the test code for the day:

term :> term (x: int){
  (this) => int { x; }

factor :> factor (x: int){
  (this) => term { term x; }

(lhs: int) + (rhs: term) => int { asm add lhs rhs; }
(lhs: term) * (rhs: factor) => term { term asm mul lhs rhs; }
(x: int) => factor { factor x; }

entrypoint => void {
  print 2 + 3 * 4;

Now for a slight aside. The first time I ran this, I got 86. How did I get 86? Well, read on! The answer is at the bottom.

Anyways, this now produces 20, due to a subtlety in parsers that I didn't really fully grok until today when it bit me full on the ass. When you look at a formal grammar for arithmetic, there are always conversions to take function invocations, constants, paren expressions... bunches of stuff into a factor. What normal parsers do though is they determine the order of operations there, before ever knowing anything about the function invocation, constants, paren expressions, etc. They basically form the structure of the code ignoring any sort of conversions or trickery that needs to be done to turn that token into an int.

Since I'm doing the order of operations after knowing what trickery is necessary, the parser has forgotten if the int it sees is a constant, or the result of a function, or all of the limitations that formal grammars put on conversions to a factor. So Tangent happily converts 3 to a factor, then to a term to do the addition, and then circles the int result back around to a factor to multiply afterwards. To get this working, I'll need to break the cycle which will make normal use of integer parameters broken and/or weird.

So, 86. The root cause of the bug was in how the compiler generated its CIL. If you remember back to my post about loops, Tangent uses tail call optimization to keep function invocation workable in those sort of scenarios. The issue was that I always added the tail call to the last statement in a function. And I didn't change that when I added built-in opcodes like add and multiply (and constructor calls). So in the CIL, you saw arg1; arg2; tailcall; add;. .NET was happy to let that run, but I expect that the tailcall fubar'd my stack so that when the add actually ran, the arguments had been replaced by whatever random stuff was on the stack when that method started. Bad times.

Friday, May 8, 2015

Symbols in phrases

Quick update. One of the things that has been around in previous iterations of the language, which I've neglected to implement so far has been the ability to use symbols when defining a phrase.That lead to verbose operators in previous examples:

(x: int) plus (y: int) => int {...}
(a: bool) equals (b: bool) => bool {...}

With today's update, you can now use symbols (at the moment, pretty much any non-ascii, non-whitespace, non-open-curly/open-parens unicode character) in your phrase delcarations:

(x: int) + (y: int) => int {...}
(a: bool) = (b: bool) => bool {...}

Thursday, May 7, 2015

Product Types

Sorry for the delay in posting, but I found myself stuck deciding where to go next with the language. On one hand, I want to get the thing built enough that I can build non-trivial programs with it, to see how useful/terrible the inference works in practice. On the other, I want to produce some more academic results showing that the order of operation inference is equivalent to existing methods for doing this sort of stuff. All while knowing that I was unlikely to achieve either. Let's just say that it was not the most motivational of situations.

I still haven't really decided on which path to take, but I'm moving forward anyways. Today's post is about Product Types. Wikipedia and other sources have nice formal descriptions if you're so interested. Practically, it means that I've started in on what most of you know as classes. Tangent so far has only allowed enums, which allowed types to be one of some well defined set of values. What I've added is the phrase syntax to type declarations. Consider this trivial test program:

int :> enum {a}
factor :> factor (x: int){}

foo(f: factor) => void { 
  print "in foo";

entrypoint => void {
  foo factor a;

Here, factor is a product type which takes only a single input a (poorly named) int. That's a terrible  example. Let's look at something better:

height :> (ft: int) feet (in: int) inches {}

This is modeled a bit off of Haskell. The left side of the declaration is the type name (and eventually, generic parameters). The right side of the declaration is the constructor. It will act akin to C# 6 Primary Constructors. The parameters declared in the constructor can be used within the class declaration (in curlies, empty here and currently unsupported). But since the constructor uses the same phrase syntax that functions do, you're free to make it more descriptive than new Foo(blah, blah).

The compiler will now take this code and generate a nice POCO, as well as call the constructor at the appropriate time. It only took about 4 hours too, even after the time away from the code, which is a nice sign that the underlying code is solid. You can't actually do anything with the things yet. I need to decide if I want to work with them via pattern matching or via a more OO style approach.

Sunday, March 8, 2015

Milestone 1 revisited

So it turns out that I am an idiot. Well, I was being an idiot. Now perhaps not so much. I talked in the Milestone 1 post  about how looping eventually will overflow because of how it's implemented. In short, that I can't use tail call optimization because the compiler can't tell if the recursive call is in tail position, since conditionals are implemented in-language.

That was entirely true. But what I failed to realize (until last night, idly thinking on the edge of sleep) is that it doesn't matter if the recursive call is in tail position. Let's look again at our while code:

while (condition: ~>bool) (body: ~>void) => void {
    if condition { 
        while condition body; 

and think about how this gets compiled into CIL:

temp1 = condition
temp2 = new Closure({ body; while condition body; });
call if(temp1,temp2)

So while the recursive call is in tail position for the closure, the real eureka moment was realizing that the conditional was the thing that mattered conceptually. It was the function call in tail position for the loop. All of the internet articles reiterate how conditionals are special for determining tail call usage, and because I was so focused on the recursive aspect I did not stop to consider how doing things differently invalidated conventional wisdom.

Anyways, loops now work as expected without stack overflow and without needing to build them into the language. The code is in github marked as Milestone 2.

Thursday, March 5, 2015

Changes to return and added debugging support

Two updates to talk about today.

The first is a change to how return works in the language. It now does not work. This is a side effect of making blocks into implicit closures. By doing that, it lets you pass them around (cool), but it makes it really awkward to return from them. I had intended to have return simply set a return value and exit, allowing the calling function to use the return value if it was expecting one. But even there, it becomes confusing if the return value is for the function or for the closure. In the end, I decided it wasn't really worth the confusion (for now).

So instead, Tangent will now do a little type inference with blocks (and functions). Similar to CIL, the return value for a function (or block/closure) will be the last statement in the function. So it works something like this:

f(x: ~> int) => void { }
g => string { 
    f {
       // ...

So the normal Tangent order of operation inference works, but the target of the algorithm differs. For normal blocks, the last statement will need to return void, just like any other statement (or any other block). Otherwise, it provides a mechanism for basic anonymous functions.

The second change was the addition of debugging info to the compiler. If you start debugging versus the exe the compiler generates, you'll now be able to step through code, set breakpoints, inspect variables, etc. While it was a lot of work to wire the line/column info from source to compiler, there's not much to actually making the debugging work. One extra ILGenerator call to mark the reductions with their location in code. Visual Studio does the rest. I had done this with earlier revisions of the language, but it is a nice reminder of how powerful well made software can be.

Friday, February 20, 2015

Milestone 1

Finally, some approximation of success. Milestone 1 has been marked on github and is free for to download. The language has enough functionality to meet my initial goals, which is to allow in-language conditionals and in-language loops. It does the order of operations inference and actually runs. Unfortunately there is one small issue and one fairly significant issue. The small issue is that you can't return out of blocks currently. Since blocks are implemented as closures, I can't jump to the end of the function since I'm no longer in the function when the return is hit.

The bigger problem is that because blocks are implemented as closures, using a block in a while loop will (eventually) cause a stack overflow. There are a few ways around that, the most natural to this design being implementation of Tail Call Optimization. Except tail call optimization can only be done when I know the closure call is in tail position. And since conditionals are done in-language, the compiler can't tell. I asked a question on stack exchange about it and... did not get help really. The academics don't care about the implementation and the pragmatists don't care about the theory.

Oh, and the error reporting is horrific and there are no debugging symbols yet. Sorry.

Anyways, here is the test program for milestone 1 - basic booleans, if statements and while loops - all defined in-language with only enum declaration, function declaration, void/unit-type, specialization and closures built into the language itself:

bool :> enum {

if (condition: bool) (positive: ~>void) => void {

if (condition: bool) (positive: ~>void) else (negative: ~>void) => void {

if (condition: bool.true) (positive: ~>void) => void {

if (condition: bool.true) (positive: ~>void) else (negative: ~>void) => void {

(lhs: bool) equals (rhs: bool) => bool {
    return false;

(lhs: bool.true) equals (rhs: bool.true) => bool {
    return true;

(lhs: bool.false) equals (rhs: bool.false) => bool {
 return true;

while (condition: ~>bool) (body: ~>void) => void {
    if condition { body; while condition body; };

entrypoint => void {
    while true equals true print ".";

Which in turn compiles to CIL (yes, CIL lets you define functions with spaces in the name):

I'm not sure what the next steps will be quite yet. I expect it will be implementing ints with proper order of operations in-language. Depending on the approach, I may first move this stuff into built-ins to help improve performance and structure of the built code. I also might get distracted by the lack of general equality and push generics in to support that, or some fancy ill-conceived idea like usual.

But keeping things simple has worked fairly well so far (and cs.stackexchange less so). A good lesson to remember.