Island Grammars

Program-Transformation.Org: The Program Transformation Wiki

An island grammar only precisely defines small portions of the syntax of a language. The rest of the syntax is defined imprecisely, for instance as a list of characters, or a list of tokens.

  • In BuildingDocumentationGenerators, we have described an Island Grammar as consisting of
    1. detailed productions for the language constructs we are specifically interested in
    2. liberal productions catching all remaining constructs; and
    3. a minimal set of general definitions covering the overall structure of a program. -- ArieVanDeursen


From AQuickIntroductionToSDF:

   module Message
      Msg   -> <START>
      X*    -> Msg
      ~[]   -> X {avoid}

   module Mailaddress
      Segment "@" Hostname     -> X
      [A-Za-z][A-Za-z0-9\-\_]* -> Segment
      {Segment "."}+           -> Hostname

Thus, this grammar gives a precise syntax definition only for mail addresses (the islands), and describes the rest of a message as a list of arbitrary characters (the sea).


The purposes of IslandGrammars are the following:

  • Faster parsing: the grammar is smaller, and contains less (local) ambiguities (hopefully).
  • More robust parsing: an island grammar is less dependent of language changes or differences between dialects than a regular, precise one.
  • Ignorant parsing: full knowledge of the syntax of a given language is not required to parse it.


I have some quesions:

  • Does anyone have more examples of IslandGrammars?
  • Is there evidence that parsing with IslandGrammars can indeed be faster than with regular grammars?
  • How about LakeGrammars? or SkeletonGrammars?: grammars where the "top" of the syntax is defined precisely, while some lower non-terminals are defined imprecisely. For instance, a precise Yacc grammar where the semantic actions (C code) are defined imprecisely?

I've occasionally done a skeleton grammar, usually for language conversion. It's just an ordinary grammar, byt the lexer recognises an extra token, usually called stuff which matches any token or other junk that does not occur in the grammar. Very useful for things like converting between control-statements that have end-markers and ones that don't, for everting C declarations, etc. It gets most of the boring bulk editing done, leaving you to do the changes that need actual brainwork -- HendrikBoom

ErnstJanVerhoeven?'s MSc thesis contains several examples of IslandGrammars in SDFII, which are mostly related to COBOL analysis for DocGen.

The current distribution of DocGen also includes an SQL island grammar in JavaCC. It only extracts the basic SQL commands (insert, delete, update, select, cursor declarations, ...) but essentially ignores expressions. The implementation in JavaCC (done out of portability reasons) is quite a challenge, as it is an LL1 parser generator. -- ArieVanDeursen

LeonMoonen as a paper on Island Grammars at WCRE 2001, see

CategoryReverseEngineering | Contributions by ArieVanDeursen, JoostVisser, HendrikBoom