# Phases and Rules

Now the EK9 grammar has stabilised, it's time to ensure that the EK9 code written by the developer actually makes sense. All of these phases use ANTLR Listeners/Visitors but all delegate processing, rules and checks to *functions*.

These functions tend to be composed with even more *functions*. Each one is designed to be focused on one task (or a limited number of related tasks). This is in contrast to the first prototype that used a more Object-Oriented solution. This quickly became very complex. The *function* based approach is much simpler and enables much more reuse through composition.

### Phases/Passes

With the EK9 grammar and language structure it is not possible to validate it all in a single 'pass'. This is because EK9 supports:

* Forward use (i.e. reference before declaration)
    
* Type inference
    
* It does not include the concept of *header files*
    
* Some rules can be checked very early and so:
    
    * Can emit errors very quickly, speeding up the development cycles
        
    * Other rules need more details and so checks can only occur in later phases
        

This means that EK9 source code can reference other **functions**, **classes**, etc. from other EK9 source files or even the same source file when they are declared after their use. Moreover, EK9 has a form of type inference within code blocks meaning that the compiler can work out the **type** of variable being used in many cases.

To accomplish this, the EK9 compiler *visits* the AST a number of times (in phases). Importantly some of these phases can process code structures concurrently on a per file basis. The EK9 compiler has been designed to use multi-core CPU's from the outset.

### Phase Zero

Accepts source files that are to be compiled and parses them concurrently. The parsing of source files is quite an *expensive* operation. This phase leverages Java **ANTLR** to parse and validate the overall syntactic structure of the EK9 source code. The errors at this level are **ANTLR** generated. This is only focussed on **syntax** and not **semantics** (i.e. the meaning of the code).

The main issues in parsing are:

* Reading data from slow storage (yes SSD's are slow)
    
* Breaking that source data text into Lexemes
    
* Building an Abstract Syntax Tree (AST) via the grammar
    

Given that most applications consist of hundreds if not thousands of source files (and that EK9 always works from source), it is essential that the above process is carried out quickly.

With the advent of widespread availability of multi-core CPU's, increasing cpu-core count and memory - concurrent file processing is viable.

However, this also means that the EK9 compiler will use a significant amount of memory. It is designed from the outset to only use memory, no file paging is employed at all. Big programs will require big memory and lots of CPU’s. But really modern development should at least try not be build huge monolithic applications where possible. They should be designed to be modular.

It is during this phase that an appropriate [error listener](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/common/ErrorListener.html) is configured and associated with each source file. Subsequent phases will emit errors in compilation using this [error listener](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/common/ErrorListener.html) object.

[See code documentation](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/phase0/package-summary.html) for details of phase zero.

Phases 1-5 include layered and progressive **semantic** checks over and above enriching the internal **Symbol** definition and processing.

### Phase One

Once each EK9 source file has been loaded and parsed it can be 'visited'. This means the AST can be traversed and **Symbols** defined by the EK9 developer can be identified and recorded. This all from data now held in memory.

This is in effect the first real 'pass' through the EK9 code, it is run in a concurrent manner. However, this is where things start to get complex.

Where multiple EK9 source files form part of the same **module** the **Symbols** must all be recorded in that same **module**. Moreover they must not *clash* with each other.

To accomplish this, the EK9 compiler protects **modules** with concurrency locks. This is designed to prevent multiple processing threads altering the internal state at the same time.

There is actually multiple forms of **Symbol** recording:

* Against an AST *tree node*
    
* Within a *module*
    
* Within some form of *aggregate/function* in a *module*
    

It is also during phase one that **types** are identified (where possible) on variables and properties. Typically these are just when **literals** are used, in general this is the approach the EK9 compiler takes. It builds up a more detailed picture of the **types** as each of the phases is processed.

See [phase one details](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/phase1/package-summary.html) for some of the many functions that get called during this phase. These are all just basic checks over and above the ANTLR grammar.

In general the ANTLR grammar has been made more flexible and simpler, this has enabled more rules and better error messages to be built into the EK9 compiler. It also simplifies the grammar to some extent.

### Phase Two

During Phase One all of the main *aggregates* and *functions* will have been defined in a very basic skeleton form, this phase starts to add more detail around those constructs mainly focusing on **type** information.

The main purpose of this phase is to identify and resolve:

* Explicit **type** use
    
* Explicit **generic type** use
    
* Simple ‘constructor based’ inferred **type use** including construction of parameterised genetic types
    

The processing in this phase also does a number of additional basic checks (**semantic checks**) now that **Symbols** and their relationships have at least been outlined.

This will include ensuring that **types** are open for extension and are of the same 'Genus'. This means it is not logical or allowed for a 'Class' to extend a 'Record' for example.

It is also during this phase that *functions* are examined to see if they fall into one of the following categories:

* Consumer
    
* BiConsumer
    
* Acceptor
    
* BiAcceptor
    
* Supplier
    
* Provider
    
* Function
    
* BiFunction
    
* UnaryOperator
    
* Predicate
    
* BiPredicate
    
* Assessor
    
* BiAssessor
    

As EK9 treats *functions* in a polymorphic manner it automatically makes those *functions* 'super types' of the above common patterns - where their arguments, return values and 'pure' nature match. The generic functions above are really just common patterns of function signatures.

If a function does ‘match’ the signature of one of the above generic functions, then its ‘super function’ (a bit like a ‘super class’ for aggregates) is set. It should also be noted that each of the *generic functions* list above have a ‘super function’ of **Any**.

While it may seem strange to give ‘functions’ a sort of hierarchy - it enables ‘functions’ as well as ‘aggregates’ to be treated in a polymorphic manner (sub-typing).

See [phase two details](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/phase2/package-summary.html) for all the range of operations that are carried out in this phase.

### Phase Three

This is a key phase as it checks for **resolution** of all **symbols**. It also deals with processing and deducing **inferred types** as part of code block expressions.

There are many rules, checks (more **semantic** assertions) and processes in this phase, indeed when this phase is triggered and the AST is 'visited' each tree node is processed on both *entry* and *exit*.

The [Listener](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/phase3/ResolveDefineInferredTypeListener.html) is the hook in from the ANTLR infrastructure, but really does very little other than calling one or more of the *functions* that perform the key processing of the specific EK9 language construct.

This approach has been taken to ensure there is a clear separation of concerns and a single responsibility for each of the aspects of processing.

Any [common](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/common/package-summary.html) or [support](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/support/package-summary.html) *functions* are pulled out and made reusable in different phases.

During [phase three](https://repo.ek9lang.org/apidocs/0.0.1-SNAPSHOT/org/ek9lang/compiler/phase3/package-summary.html) many functions are called and most either populate/augments the **Symbols** identified or they process rules and emit compilation errors.

### Phase Four

This phase checks that where a real type (or indeed a conceptual type in the case of generics using generics) has been employed. When parameterising a generic type with one or more types those types must support all the operators needed (used within the generic type). This is in the situation where generic/template types have been defined and operators have been ‘assumed’.

This approach enables generic types to be created with the assumption that when parameterised those type will have the right operators. Clearly the EK9 developer may attempt to use the generic type that does not have an essential operator - in this case a compiler error will be emitted. The EK9 developer can then add the operator to their type.

### Phase Five

Now all symbols have been identified and all references resolved. This phase is the PRE\_IR\_CHECKS phase. It is designed to be the last set of **semantic** checks that can be made with just the ANTLR AST and the symbols identified.

This is done because the generation of the ‘Intermediate Representation’ is quite costly (in time and memory), so any obvious issues that can be identified now can cause the compilation to fail as early as possible.

The typical checks in this phase are:

* Variables being used before initialised
    
* Return values not always being initialised
    
* Safe access on Optional/Result methods such as get(), ok() and error()
    
* Guard expressions and uninitialised return values (if/for/try/while/switch, etc)
    

These checks make the EK9 language quite opinionated in terms of what ‘good code’ and ‘bad code’ is. From experience I’ve found that many of the longer term bugs and defects have been caused by issues relating to these checks. Hence, I’ve added these checks in to stop **me** writing code that could cause errors.

### Phase Six

Resolving external libraries and built-in code for EK9 code that is marked as **extern**.

Resolution/linking of the built-in types that come as part of EK9 or in the future when there are other platform specific libraries/modules.

The EK9 compiler comes with lots of predefined and built in types and functions. But in reality they are only defined in terms of being and external interface - in other words they have no concrete implementation.

While this may seem strange, EK9 is designed to be able to have multiple ‘back-ends’ to produce different type of executable code (see later phases). For example it could be that the EK9 compiler (while written initially in Java) could produce outputs of:

* Java byte code for the EK9 applications developed
    
* LLVM code and then several final binary outputs
    
* Direct platform specific binary outputs (or even cross platform support)
    

### Phase Seven

Focusses on the creation of an ‘Intermediate Representation’, this layer is the full abstraction away from the EK9 language and is much more general in nature. It does away with specific checks (as earlier compiler phases have ensured that the structures and semantics are coherent.

The ‘Intermediate Representation’ design is really important, it has to remove most (if not all) of the EK9 language specifics and move towards something much more general. It must also avoid becoming too ‘target architecture’ focussed as well. It must however reify specific information from the EK9 program (such as **type** information). This will be quite a balancing act to get right.

There is little difference at the IR level between a Component, Class, Trait, Text or Record - they are all just really an **aggregate**. Strangely you can also consider a Function or a Dynamic Function to be just an **aggregate** with one method. This works very well for the dynamic functions as they can actually **capture** data as properties (much like an aggregate with no accessor methods for those properties).

Even operators now just become methods on those **aggregates**. This whole approach enables the IR phase to just create various ‘flavours’ of aggregate- it will annotate/reify them with sufficient detail that they can be identified during code generation and runtime in a very specific way.

But this approach is an essential one to enable the ‘dispatcher’ and ‘function delegate’ approach to work as now everything is just an aggregate (i.e. it is an EK9 ‘Any’). So even instances of functions become ‘objects’ and can be passed around. But the dispatcher code will need to be able to access the ‘Any’ and find out its real type so that appropriate dispatcher methods can be called.

For example; within a module there will be **constants**, these will just be ‘instances’ of an ‘aggregate’ of a specific type (i.e. Float, Date, etc.). But so will named functions, they too will just be ‘instances’ of a ‘function’. A named function can therefore be considered just a **constant** but of a specific ‘function’.

### Phase Eight

All of the EK9 developer created generic/template types and their concrete (parametrised forms) can now be included in the ‘Intermediate Representation’. This may take several forms (unsure which approach to take at present). But it is quite possible that some form of simple aggregate (without the implementation is created and a real concrete implementation is delegated to (but in a very general and untype safe way), with casting to and from real concrete types.

The alternative would be to create real implementations from the generic type/template in a ‘cookie cutter’ fashion. This is likely to create a significant amount more code. This approach probably won’t be taken.

### Phase Nine

This phases is designed to be a place-holder for Intermediate Representation Analysis and Optimisation. Initially this will not be implemented. It can be quite complex and time consuming to implement and really I’d like to move ahead with code generation (so I can finally see something coded in EK9 actually run).

### Phase Ten

Code (Byte code, LLVM, binary) will be generated using the ‘Intermediate Representation’ and associated **Symbol** data.

At this point - my view on how this will be efficiently implemented is quite sketchy (TBH), how much caching can be done to avoid regeneration of code sections that are unchanged is not clear.

### Phase Eleven

This is just a placeholder for optimisation of generated code, for some architectures this will be essential, but for others (like JVM based architectures) most of the optimisation will be done at runtime.

### Phase Twelve

Finally, the last phase, this is just packaging. This could be into an executable either for the platform this compilation is running on or maybe for another platform if ‘cross-compilation’ is required. For Java byte code generation this phase would probably produce a ‘jar’ file.

## Summary

While the first few phases above are quite detailed, you can see that the latter few are much more general. This is because most of the first phases (up to phase 5) have been implemented (I’m sure there will be bits missing that come to light as the later phases are developed).

The description of phases 6-8 are quite detailed (as they are now become more obvious to me as to what will be needed). I’m still unsure how to implement the **resolution/linking** of types that have been defined via ‘extern’ interfaces. Clearly this does depend on the ‘target architecture’. So for Java for example I may just load the ‘EK9-Lang.jar’ (that I’ll need to code up). Then using introspection check that a method on say **org.ek9.lang.String** as defined in the compiler does have all the correct methods and signatures as implemented in the Java target architecture and packaged in ‘EK9-Lang.jar’.

This would then able the compiler to generate the correct ASM (java byte code) to make a call from a developer EK9 code using the **org.ek9.lang.String** to the Java implementation.

For LLVM solutions, there would need to be some similar mechanism. Clearly this has been done for linking Python code to binary shared libraries. One mechanism to enable early resolution of symbols (before runtime), is to ensure that a binary shared library has a ‘known’ entry point.

By enabling the EK9 compiler to call this ‘known’ entry point in the shared library would enable the shared library to respond with some form of data structure that states what functions, constants, types etc exist with in it. So for the ‘EK9-Lang.so’ lets say compiled for Linux it might respond back with just a plain (but long) **String** that has basically the same **types** and structures as the built-in EK9 language interface as defined by the compiler. Indeed, maybe the current built-in types hard coded in the compiler would be removed and just the ‘EK9-Lang.jar’ used via the well known call. It would provide the ‘interface’.

In the same way that the compiler can check and resolve methods against the built-in ‘extern’ interface definition. It can be used at resolution/linking time to resolve the same types and calls on the interface supplied by the ‘EK9-Lang.so’. This does of course depend on the ‘EK9-Lang.jar’ or ‘EK9-Lang.so’ actually correctly reporting what it has (if this is incorrect then errors will occur either at linking time (for binaries) or runtime for Java ‘jar’ combinations.

This latter approach, does seem more scalable and would require less code in the compiler:

* Locate library depending on target architecture
    
* Make a call (using that target architecture) from the compiler to the ‘known’ entry point.
    
* Use the ‘String’ response (containing the EK9 **extern** interface definition) in the compiler and parse it.
    
* In effect this **extern** interface definition is that the EK9 developers code must use
    
* Assuming that the developer of the EK9-\*.jar or EK9-\*.so (or whatever was provided did a good job and all the constructs outlined in the ‘EK9 **extern** interface definition’ were correctly implemented in the EK9-\*.jar or EK9-\*.so then at linking/runtime the calls would be resolved and work.
    

The above approach, would enable 3rd parties or EK9 developers that wanted to wrap and use existing binary or other code in this sort of interface and make it available to be used in EK9 code.
