Syntax - the thorny issue of syntax

Why are tabs, spaces, braces and semi-colons so divisive?

·

15 min read

Syntax

Background

In my previous post, I made the argument for a lot more constructs. But this post is how all those constructs, immutability and flow control will be expressed in EK9.

It might seem strange that I've left the aspect of syntax until this point in this series of posts. While the syntax is really the first thing most developers see when they start to use a new language; underneath the DNA of the language exists in its own right.

Who is the code for?

The code is for the developer to provide functionality in the form of software for a customer. But there are a couple of other interested parties besides the developer and the end customer:

  • The compiler -- this transforms the code in to some form of executable
  • Other developers -- unless your code is small/trivial; it will be read by other developers
  • Your future self -- you will spend quite a bit of time in the future reading code

You may think the idea of your future self is a contrivance. But when you develop code, you get deep into it and lots of context, pre-conditions and logic seem obvious. But with the passage of time all of that fades and you are only left with the code. So that code needs to be clear.

So what's needed?

The end customer just wants the functionality, but they also want the developer to be able to add new features in a timely manner. They absolutely want it to work and don't want defects.

The compiler needs clear unambiguous code. This needs to be parsed so it can inform the developer if there are potential issues and to produce the final executable for the end customer. So we need to ensure that we provide the minimal readable syntax that enables to compiler to process the code.

The developers (including yourself) need:

  • To be able to organise the code in a logical manner to aid navigation
  • To be able to quickly read and comprehend the code
  • Some degree of symmetry and consistency in the syntax
  • To type the minimal amount of text (but it must still be clear)
  • To be able to make changes without introducing defects
  • To be able to express themselves using design patterns
  • To be able avoid obvious pitfalls and mistakes

Note the above list is just for the syntax, the developers also need the constructs to enable them to express themselves and create abstract concepts. This has been covered in a previous post. Strong typing has also been covered and is a key part of EK9.

Developer needs

Focusing on the needs of the developer from the section above; means that we need to understand how developers and in a more general sense how humans comprehend written text. Only then can we get an idea of what a syntactic layout might look like. The aim is to enable the developer to write idiomatic EK9.

Directory Structures

Developers will come up with and evolve file system directory structures when organising code; these enable software to be broken down in to related concepts or sub systems. Indeed the tree like structure (or directed graph) is quite a natural organisational mechanism. So EK9 will facilitate the full and free use of directory structures for files that contain EK9 code.

File Names

For source code files and files in general most developers and users of software expect a file name to end in a suffix. This suffix enables them to know what the content is likely to be; for example .txt, .png, .xls. So EK9 source code files will have a suffix of .ek9.

Source File Layout

Most people are used to reading books with some form of internal structure; such as:

  • Forward
  • Introduction
  • Chapters (N)
  • Conclusion

Languages like C, C++, Java, C#, etc have a minimal layout; but this is mainly due to their limited range of constructs. This has one major disadvantage; it means that you see the words fun/function and class/interface repeated frequently.

Pascal and some other languages have a sectional layout. This approach means that if you define a functions block or a procedures block; everything in those blocks is naturally a function or a procedure.

As EK9 has quite a few constructs it will follow the Pascal model. This reduces the amount of repetitive text that a developer needs to type, it provides a logic and rational structure within the EK9 source code file.

Here is the broad idea of that layout:

#!ek9
defines module introduction
  references
    //Multiple references to constructs
  defines constant
    //Multiple constants
  defines type
    //Multiple types
  defines function
    //Multiple functions
  defines trait
    //Multiple traits
  defines class
    //Multiple classes
  defines record
    //Multiple records
  defines text
    //Multiple text for languages
  defines component
    //Multiple components
  defines application
    //Multiple applications
  defines service
    //Multiple services
  defines program
    //Multiple programs
  defines package
    //A package definition

So the reasons for this approach are driven by the logical nature of what most humans are used to reading (books) and the desire to reduce typing (where possible, but retaining clarity). The only real downside of this approach is the fact the developer needs to know the context; this is not too bad with a small number of construct definitions. But as EK9 enables multiple files to be part of the same module it means files can be numerous and short/related.

This gives the developer flexibility on how to group files in directory structures and how best to group specific constructs within the sections in those files. So for example; the developer could elect to put all the functions in one file and all the classes in another; or they could mix and combine constructs.

Block Layout

Vertical layouts can be read with ease:

  • Column based layouts (like newspapers) -- but only to a finite degree
  • Tree/Graph layouts -- but again only to a finite degree
  • Indented code block layouts -- but only to a finite depth

The key point in the bullets above is restraint. Deeply indented, excessive columns or dense tree structures will not work; moderation is essential. EK9 accomplishes this in a variety of different ways; not through hard compiler driven rules, but by creating pain and therefore encouraging decomposition. This further aids reuse and testing. This is pain creation is deliberate, it sounds perverse to design something that limits and frustrates; but we developers need to be encouraged in the right direction.

A concrete example of this is the omission of direct inline lambda functions. EK9 offers Dynamic Functions instead. These separate the definition of small dynamic functions from their use. This is a deliberate pain point aimed at encouraging structure and facilitating reuse and testing.

Visual Flow

We can also understand horizontal flow, such as:

  • -> -- flow into something
  • <- -- flow out of something

These ideas will drive the layout and syntax of EK9. This layout is quite different to many other programming languages (though there are some similarities to Python in places).

Punctuation

OK, now it's time to bite the bullet.

I don't like too much punctuation in source code, I always ask myself the following questions:

  • Why does another developer need a colon ':' or a semi-colon ';'?
  • If I've indented this block of code -- why do I need the '{' '}'?
  • How can I encourage more or less use of capabilities in the appropriate context

Clearly we've had a good example of this with Python/YAML. But it's not all plain sailing with white space and the amount/type of indentations. It is possible to introduce defects that are not easy to spot.

But the same can be said of languages that employ semi-colons. Here's an example:

for(int i=0; i<10; i++);
  printf("Hello");

The use of '<' '>' with generics/templates in C++ and Java for example is another area that can lead to excessive punctuation. Why not just use:

//List example
List of String
//Dictionary example
Dict of (Integer, String)

There is a place for punctuation characters as shown here for lists.

...      
  aList <- [ 1, 2, 3, 4, 5, 6, 7, Integer() ]
...

and here for dictionaries/maps:

...
  aDictionary <- { 1: "Gomez", 2: "Morticia" }
...

This list example and dictionary example give more details; the use of punctuation in definitions feels more natural. When this punctuation is used in combination with type inference as shown above; a very terse, simple and concise statement is created that is clear to read and requires minimal typing.

But the streaming pipeline and operators use punctuation extensively. Part of the Streaming Pipeline example is shown below (some functions omitted for brevity):

...
  defines function

    filterBooksToOutput()
      ->
        books as List of Book
        filterSelection as BookFilterSelection
        output as StringOutput

      authorId <- getAuthorChange()  
      theBookFilter <- suitableBookFilter(filterSelection)

      cat books
        | sort by compareAuthor
        | split by authorId
        | select with sufficientBooks
        | map by orderOnPublishedDate
        | map with theBookFilter
        | flatten
        > output
...

This should give you some idea of the look and feel of EK9 source code.

Indeed there are additional operators in EK9 that don't appear in most languages. EK9 has all the normal comparison operators. It also provides fuzzy comparison, null coalescing, Elvis operator and some short cuts that combine null coalescing and comparison operators.

Here are some examples:

bird1 as String
bird2 ← "Duck"
birdA ← bird1 ?? bird2
birdA := bird1 <? bird2
birdA := bird1 <=? bird2
birdA := bird1 >? bird2
birdA := bird1 >=? bird2
//In all above examples birdA would have the value of "Duck"
//because 'bird1' is not set 
birdB ← bird1 ?: bird2
//birdB would also have the value of "Duck"
//because 'bird1' is not set

bird1 := String()
birdA := bird1 ?? bird2
//birdA would now be un set and not have a meaningful value
//because it has taken same value as 'bird1'

But the main point is that in EK9; the punctuation characters are used for assignments, lists, dictionaries and operators in general and not for code block definitions.

The range of new forms of operator, ternary expressions, type inference will probably take any developer a while to get used to. The aim of these operators is to reduce the amount of typing required and to make the code terse but readable when revisiting.

I've used '<' and '>' punctuation in combinations with other characters to facilitate the use of block comments. Again is is by design to encourage some forms of commenting -- but not an excessive amount.

Comments

Commenting code is always a point of contention for developers. Some say comment everything; others say comment the why not the how. There is no simple answer that will keep everyone happy.

I'd suggest comment the why; but also comment the how, if the how does not look obvious.

EK9 offers several different type of comments.

//A Simple single line comment
name <- "Steve" //Can comment end of line as well
<!-
Or a block comment
Over multiple lines
-!>

defines function
  <?-
  This is an explicit documentation comment.
  -?>
  function1()
    -> n Integer
    <- sum Integer: 0
    ...

EK9 does not offer /* type block comments */ nor Java style /** Doc comments */.

Parameter declarations

Declaring functions and methods that accept incoming parameters is significant in terms of design. The developer is defining an interface that will be used. By combining the ideas from the block layout above; EK9 provides quite a unique and clear way to declare incoming and returning parameters.

#!ek9
defines module introduction
  defines function

    function1()
      -> n Integer
      <- sum Integer: 0      
      ...

    function2()
      ->
        n1 Integer
        n2 Integer
      <-
        sum Integer: 0
      ...
//EOF

So data being passed into the function is demarked by using -> and data being returned is demarked by using <-. For multiple parameters there is also an indentation block.

This mechanism creates tree/graph, column structures but also indicates flow. It encourages single parameters; in general fewer parameters is a cleaner design.

Source code text is not densely packed and white space is created. UI designers and web developers spend a significant amount of time designing white space, they do this with good reason. EK9 uses this same approach but in software source code.

This does make the source code files longer and much less dense. This is by design, this encourages the developer to break source files up.

It is easy to glance at a block of code and just see the flow in and out with parameter names/types. It also limits the length of each line, ideally source code line length should be 80-120 characters. This is not driven by terminal width, font size nor any other technology capability; it is purely to do with how humans comprehend text.

It is not a hard and fast rule, but the EK9 syntax structure encourages a general move to shorter lines in a more column based format. Here is a short example (streaming pipelines):

...
cat books
  | sort by compareAuthor
  | split by authorId
  | select with sufficientBooks
  | map by orderOnPublishedDate
  | flatten
  > stdout
...

This could have all been done on a single line, but it just much easier to see at a glance when in a column format as above.

End of Line

EK9 does not use the semi-colon ';' at all. It is not even used as an end of line marker.

Variable Declarations and Assignments

Variable declarations and variable assignments have a number forms. These are designed to provide the developer with a number of natural and simple syntactic symbols that are familiar from other languages. But importantly they provide a mechanism in EK9 for type inference and also provide a very terse way to work with variables.

Here are some examples:

age as Integer: Integer()
age as Integer = Integer()
age as Integer := Integer()
age as Integer := 21
age <- 21

The first few lines declare a variable as an Integer but of an unknown value. The final example shows the most terse mechanism (this uses type inference). This type inference can be used with any type including your own type definitions, classes, traits, components and functions. For more details see type inference.

Initially I didn't like this type inference syntax, but over time it grew on me more and more (when I say time I'm talking years here). Now I look at it and it seems so concise, terse and elegant; it's really about the shortest amount of code you could write.

It has an interesting side effect (which was not planned); it encourages the declaration of new variables in preference to re(mis)-using existing variables. When designing pure code that is immutable, this; in conjunction with ternary operators enables very succinct code to be written.

Flow Control

The if/else and switch syntax is pretty much what you'd expect; nothing too radical. Though there are alternate syntaxes available, guarded assignments and switches can return values. So there are some innovations, but several other languages already have these.

Here is a show example of an if statement:

...
  defines program
    simpleIf()
      stdout <- Stdout()
      valueToTest <- 9

      if valueToTest < 10
        stdout.println(`Value ${valueToTest} is less than 10`)

      //Rather than use the keyword 'if' you can use 'when'
      //Also you could just use string concatenation
      when valueToTest < 10
        stdout.println("Value " + $valueToTest + " is less than 10")
...

The for loop takes a few ideas from other languages; but is less like C, C++, C#, Java, Javascript. It is more like Pascal and ADA.

Importantly the loop variables type is inferred and the value of the loop variable is in effect immutable as far as the developer is concerned. This means the developer cannot alter the loop variable.

Here is an example (it uses time of day and durations in the loop):

timeForLoop()
  stdout <- Stdout()
  startOfWorkDay <- Time().startOfDay() + PT9H
  endOfWorkDay <- Time().endOfDay() - PT6H30M
  thirtyMinutes <- PT30M

  for theTime in startOfWorkDay ... endOfWorkDay by thirtyMinutes
    stdout.println(`Value ${theTime}`)

It can also take this form with collections and iterators:

for item in ["Alpha", "Beta", "Charlie"]
  stdout.println(item)

And this form with Stream pipelines:

defines function
  workHours()
    -> time as Time
    <- workTime as Boolean: time < 12:00 or time > 14:00

  timePipeLine()
    stdout <- Stdout()

    start <- Time().startOfDay() + PT9H
    end <- Time().endOfDay() - PT6H30M

    for i in start ... end by PT30M | filter by workHours > stdout

Hopefully you can now start to see how the EK9 syntax has been formed and how readable/expressive it can be. With the introduction of streaming pipelines, EK9 promotes a more functional approach. This further encourages steps in processing that can be expressed in a column layout when the steps are numerous; this aids readability.

Summary

Superficially EK9 looks a little like Python. Whilst the block notation does use indentation; there are few other similarities.

I challenged the need for every character whilst developing the EK9 syntax. I followed the layout design from Pascal to reduce fun/function, class text by using a sectional structure.

The introduction of a range of new null safe (is set) operators that act more like ternary operators enable a reduction of if/else statements.

EK9 is strongly typed and compiled. The EK9 syntax has been created and developed with:

  • Vertical structures -- like paragraphs and sections
  • Tree structures -- like directories
  • Horizontal structures -- like newspaper columns
  • Horizontal flow -- across the page
  • Minimal text -- that still permits easy comprehension
  • Concepts and ideas from a range of other languages
  • Limited punctuation for code block structures
  • Additional operators to reduce code further

The final syntax is not one I was expecting; as my background is in C, C++ and Java. In some ways the EK9 syntax even feels alien to me!

The definition of the EK9 language is in the form of an ANTLR4 grammar.

Braces and Indentation

I've not really answered the question: 'Why are tabs, spaces, braces and semi-colons so divisive?'. I don't really have an answer I'm afraid. Having used a range of languages including C and Python I'm surprised the strong reaction people have to a few characters. Though I have to confess to an aversion to punctuation in block structures myself.

I've been driven by how I best understand code and its layout. I've tried to find some patterns to see if I could come up with a direction; and to tried to be consistent.

Next Steps

The next series of posts will cover the building of the compiler. Importantly this compiler will be driven through the development of a Language Server for VSCode. The VSCode extension has already been developed and is available on github.

As the grammar for EK9 has been defined in ANTL4, much of the lexing/parsing hard work has been done for me (I'm not going to cover lexing and parsing as there is plenty of material available on that already).

Special thanks to Terrence Parr for providing ANTLR it has made a big difference when experimenting with grammars and structures.

So much of the next post will cover the development of the Symbol Table.