\documentstyle{article} \newcommand{\code}[1]{{\tt #1}} \newcommand{\emph}[1]{{\em #1}} \newcommand{\mt}{{\sl Modula-3}} \newcommand{\dc}{$\mathrel{:}$} \newcommand{\bra}{$\langle$} \newcommand{\ket}{$\rangle$} \newcommand{\Es}{${\cal E}_s$} \newcommand{\Ecs}{${\cal E}_{cs}$} \newcommand{\Tn}{${\cal T}_n$} \newcommand{\Ti}{${\cal T}_i$} \newcommand{\Tr}{${\cal T}_r$} \newcommand{\Ts}{${\cal T}_s$} \newcommand{\Tm}{${\cal T}_m$} \newcommand{\Td}{${\cal T}_d$} \newcommand{\Tx}{${\cal T}_x$} \newcommand{\TCO}{${\cal TCO}$} \newcommand{\TC}{${\cal TC}$} \title{Finding Dependencies on the \code{NUMBER} of \mt{}'s \code{CHAR} Type} \author{Mike Spreitzer} \begin{document} \maketitle \section{Introduction} The \mt{} programming language is defined in chapter 2 of the book ``Systems Programming With Modula-3'', edited by Greg Nelson. On page 13, this book specifies the ``predeclared'' ordinal type \code{CHAR} to be \begin{quote} an enumeration containing at least 256 elements, \end{quote} and notes that \begin{quote} the first 256 elements of the type \code{CHAR} represent characters in the ISO-Latin-1 code, which is an extension of ASCII. \end{quote} While this specification does not assert that there are \emph{only} 256 elements in type \code{CHAR}, that is the case in the only extant implementation of \mt{}. If one considers making an implementation in which there are many more characters, say $2^{16}$ or $2^{32}$, the question naturally arises of how much code will have to be changed, and how can that code be reliably found? This paper describes programs that can be used to examine a corpus of code and estimate where changes would have to be made. The \mt{} grammar is organized into five areas: types, statements, declarations, the module system, and expressions. The analysis roughly follows this division. We start by identifying ways that expressions and types can be affected by changing the \code{NUMBER} of \code{CHAR}\@. We then identify statements, declarations, and expressions that may be affected, and also some uses of the items declared in potentially-affected declarations. Code fragments concerning the module system are not affected. \section{Kinds of expressions and types, and closure operations} In describing the kinds of code fragment that are identified, we make repeated reference to a few kinds of expressions and types. This section makes those preliminary definitions; the next section then explains the identification of code fragments in these terms. We identify 2 kinds of dependent [on the number of \code{CHAR}]\footnote{Henceforth we write simply ``dependent'' or ``depends''.} expression, 6 kinds of dependent type, and two closure operations on kinds of types. Each of these is denoted by a short symbol, and is defined in terms of others. These definitions are summarized in Table \ref{table1}, and appear in the following subsections. \begin{table} \begin{center}\begin{tabular}{|c|l|} \hline \Es & constant expression that introduces dependency \\ \Ecs & constant expression that depends \\ \Tn & type whose \code{FIRST}, \code{LAST}, or \code{NUMBER} depends \\ \Ti & type that will become impractical \\ \Tr & type that will be replaced \\ \Ts & type whose size introduces dependency \\ \TC(\Tx) & maps kind \Tx{} to kind of types that contain a \Tx{} \\ \TCO(\Tx) & maps kind \Tx{} to kind of object types that contain a \Tx{} \\ \Tm & type that depends in any way \\ \Td & type distantly related to \code{CHAR} \\ \hline \end{tabular}\end{center} \caption{Auxiliary code fragment kinds and closure operators} \label{table1} \end{table} Note that while the classification of fragments of code depends, in some places and in some ways, on the meanings of certain subfragments in their contexts, we are discussing the identification of fragments of code, not the semantic objects that are manipulated by the [meaning of] the program. Program semantics intrude, among other ways, in the necessity of finding the definitions of identifiers. The rules below omit this detail, assuming implicitly that identifiers are replaced with their definitions as necessary.\footnote{Here's a more precise formulation. Wherever a pattern has a (proper) descendant type or constant expression node when parsed as the appropriate non-terminal of the \mt{} grammar, we understand the descendant node to match not only a type or constant expression of the form parsed into that node, but also an identifier ultimately (as far as the reference can see according to \mt{} semantics --- i.e., not through un-revealed (partially) opaque types) equated to a type or constant expression of the form parsed into the node. For example, the pattern \code{NEW(REF \TC(\Tr))} matches not only \code{NEW(REF SET OF CHAR)}, but also \code{NEW(T3)} if elsewhere the program defines \code{TYPE T3 = REF T2; T2 = SET OF T1; T1 = CHAR}.} We also take the liberty of imagining that partial revelations \code{TYPE T $\mathrel{<:}$ U} have been skolemized to \code{TYPE T = $\Phi$(U)}. \subsection{\Es, constant expression that introduces dependency} There are two kinds of expression that introduce a dependency: \begin{itemize} \item \code{BITSIZE}, \code{BYTESIZE}, or \code{ADRSIZE} applied to a type that ``contains'' a type whose size introduces a dependency. \item \code{FIRST}, \code{LAST}, or \code{NUMBER} applied to a type, or a value of a type, whose number depends on the size of \code{CHAR}. \end{itemize} We use a stylized language to describe such kinds of fragment. \Es{} is described thusly: \begin{itemize} \item \code{BITSIZE|BYTESIZE|ADRSIZE (\TC(\Ts))} \item \code{FIRST|LAST|NUMBER (\Tn)} \item \code{FIRST|LAST|NUMBER (\bra a \Tn\ket)} \end{itemize} Each of these patterns describes a code fragment that would be parsed into an ``Expr'' node according to the \mt{} grammar in section 2.8 of ``Systems Programming With Modula-3''. The pattern language is very similar to the metalanguage used in that section to define the \mt{} grammar. In our language, the vertical bar (`\code{|}', used to signify alternation) has the greatest binding power, and terminals are not quoted. Our names for kinds of fragments, and closure operations on them, are used like non-terminals in the grammar. The notation `\code{\bra a \Tx\ket}' is used to stand for any fragment whose meaning is a value whose type specification is of kind \Tx; this is another way that program semantics intrude. In later patterns, the notation `\code{\bra a \Tx{} var\ket}' will be used; it matches any designator whose type specification is of kind \Tx. Ellipses (`\code{\ldots}') will also be used; an ellipsis matches any fragment appropriate for the context of the ellipsis. The first pattern for \Es{} is exact; the other two are conservative, due to the vagueness of \Tn. \subsection{\Ecs, constant expression that depends} Any constant expression constructed according to the rules of section 2.6 of ``Systems Programming with Modula-3'' from an \Es{} expression and other fragments is \Ecs. Note that this wording does not admit a constant expression constructed (using one of the rules of section 2.6.13) from a type that is in turn constructed from an \Es,\footnote{Such an expression is not admitted because of the dependency carried by the type; such an expression could be admitted because of a dependency carried by a sub-expression --- for example, \code{VAL(ORD(LAST(CHAR)), [FIRST(CHAR)..LAST(CHAR)])}.} because the construction of types is not in section 2.6; dependencies carried through types are caught by the rules for \Es{} when they re-appear in the expression part of the language. \subsection{\Tn, type whose \code{FIRST}, \code{LAST}, or \code{NUMBER} depends} \begin{itemize} \item \code{CHAR} \item \code{[\ldots{} .. \Ecs{}]} \item \code{[\Ecs{} .. \ldots]} \item \code{ARRAY \Tn{} OF \ldots} \end{itemize} That is, the \code{CHAR} type itself, or any subrange whose upper or lower bound depends, or an array indexed by another \Tn. The collapsing into one kind of types whose \code{FIRST}, \code{LAST}, and \code{NUMBER} depend leads to conservatism in patterns that use this kind. For example, because \code{FIRST(CHAR)} is \Tn, so is \code{[0 .. FIRST(CHAR)]}, and thus \code{SET OF [0 .. FIRST(CHAR)]} is considered (by the rules of the next subsection) a type that will become impractical. \subsection{\Ti, type that will become impractical} \begin{itemize} \item \code{ARRAY \Tn{} OF \ldots} \item \code{SET OF \Tn} \end{itemize} This is conservative not only because the kind \Tn{} confuses \code{NUMBER}-dependency with \code{FIRST}-dependency and \code{LAST}-dependency, but also because \Tn{} does not quantify how much its \code{NUMBER} will change. For example, \begin{quote} \code{SET OF [0 .. BITSIZE(['$\backslash$000' .. LAST(CHAR)])]}, \end{quote} while increasing in size, will not actually become impractical. \subsection{\Tr, type that will be replaced} \begin{itemize} \item \Ti \item \code{BITS \Ecs{} FOR \ldots} \item \code{BITS \ldots{} FOR \TC(\Ts)} \end{itemize} The types that will have to be changed are those that will become impractical or statically erroneous; the last two patterns make a conservative identification of the latter case. While \Tr{} is the kind of types that an enlargment of the \code{CHAR} type forces to change, the consequent re-design of code may lead to changes in other types; these programs do not attempt to do enough of that re-design to identify those other types. \subsection{\Ts, type whose size introduces dependency} \begin{itemize} \item \Tn \item \Tr \end{itemize} When the \code{NUMBER} of a type changes, its size in memory generally also changes. Types that are edited will, in general, also change their size in memory. Types that ``contain'' a \Ts will also change their size, as a consequence of that containment; such consequently-changing types are identified by \code{\TC(\Ts)}. \subsection{\TC(\Tx), maps kind \Tx{} to kind of types that contain a \Tx{}} \begin{itemize} \item \Tx \item \code{RECORD \ldots{} f\dc{} \TC(\Tx) \ldots{} END} \item \code{ARRAY \ldots{} OF \TC(\Tx)} \item \code{BITS \ldots{} FOR \TC(\Tx)} \end{itemize} The closure operator \TC{} maps a kind \Tx{} of type fragment to the kind of type fragments whose memory layout linearly includes a \Tx. \subsection{\TCO(\Tx), maps kind \Tx{} to kind of object types that contain a \Tx{}} \begin{itemize} \item \code{\ldots{} OBJECT \ldots{} f\dc{} \TC(\Tx) \ldots{} END} \item \code{\TCO(\Tx) OBJECT \ldots{} END} \item \code{$\Phi$(\TCO(\Tx))} \end{itemize} \subsection{\Tm, type that depends in any way} \Tm{} is almost, but not quite, the type fragments that mention \code{CHAR} in any way. The exceptions are due to ``default values'' for object and record fields and for procedure arguments: mentions of \code{CHAR} in those expressions do not qualify a type fragment for inclusion in \Tm. Every other way that a type can mention \code{CHAR} leads to a dependency of the represention of the constructed type, or of types ``reachable'' from it, on \code{NUMBER(CHAR)}\@. Thus, \code{PROCEDURE (c\dc{} CHAR)} is in \Tm{}, but \code{RECORD i\dc{} INTEGER $\mathrel{:=}$ ORD(LAST(CHAR)) END} is not. \subsection{\Td, type distantly related to \code{CHAR}} \begin{itemize} \item \Tm{} $\setminus$ \TC(\Ts) \end{itemize} The distantly-related types are those whose memory layout does not directly contain a field dependent on \code{NUMBER(CHAR)}, but does in some way depend. \section{Code fragments identified} \subsection{Types to change} \begin{itemize} \item \Tr \end{itemize} This kind of fragment has already been discussed. Such fragments are flagged with the message ``{\tt type needs changing}''. \subsection{Statements to think about} We include declarations, of fields and of formal arguments as well as of variables, in this kind. These patterns are all conservative. \begin{enumerate} \item \code{(INC|DEC) (\bra a \Tn{} var\ket, \ldots)} \item assignment to/from \code{\TC(ARRAY \Tn{} OF \ldots)} \item assignment to a \code{\TC(\Ti)} variable \item declaration of an \code{<*EXTERNAL*> \Tm} \item declaration of an object field whose type is in \code{\TC(\Tr)} \end{enumerate} These patterns are flagged with messages of the following forms, respectively (where {\em N} is replaced by the name of the flagged item): \begin{enumerate} \item \tt INC/DEC of a NUM(CHAR)-dependent type \item \tt assignment to/from NUM(CHAR)-dependent array [at formal {\em N}] \item \tt Copy of becoming-huge value [at formal {\em N}] \item \tt Decl of EXTERNAL Tm item {\em N} \item \tt Object field containing changing type \end{enumerate} Pattern 1 is needed because changing \code{LAST(CHAR)} changes the range of possible post-values for the variable, and may change whether such a statement gets a checked runtime error. Pattern 2 is needed because the rules for array assignability depend only on the \code{NUMBER} of the index type, not its full semantics; thus, an assignment between an \code{ARRAY [0..255] OF T2} and an \code{ARRAY CHAR OF T2} could change from valid to invalid. This pattern thus applies everywhere a value is tested for assignability to a type; the list of such places is: \begin{itemize} \item assignment statements, \item actual arguments passed to procedures (including \code{NEW}) or methods, \item the argument in a \code{RAISE} statement with an argument, \item the expression in a \code{RETURN} statement from a function, \item defaults for variables, fields, and formal arguments. \end{itemize} Pattern 3 identifies code fragments that create a mutable copy of a value of a hypothetically-impractical type; if such a type is changed to a reference to a complicated data structure, such code no longer makes a new mutable copy of the value (e.g., the set of characters) of interest to the program --- it only copies the reference to the representation of that value. This pattern thus applies everywhere a value is stored into a mutable variable; the list of such places (and the place flagged, if non-obvious) is: \begin{itemize} \item assignment statements, \item the initialization of the variable to which a \code{VALUE} formal is bound during the execution of a procedure call (the formal is flagged at its declaration and at each call site), \item the initialization of fields of a \code{REF RECORD} or \code{OBJECT} created in a \code{NEW} call (the initial value expressions in field declarations are flagged, as are the bindings supplied in the \code{NEW} call), \item the initialization of $v_i$ in an arm of a \code{TRY-EXCEPT} statement, \item the initialization of a variable given an initial value. \end{itemize} Pattern 4 is needed because declarations of external items with \Tm{} types expose \mt{}'s representation of \code{CHAR} to code written in other languages. Pattern 5 is needed because object types can be hidden: rule 4 of the next subsection will not catch cases where the fact that a type ``really is'' \code{\TCO(\Tr)} is due to a revelation not in scope. \subsection{Expressions with straightforward replacements} If \code{NUMBER(CHAR)} is greatly increased, a few standard data structures (some generic) will probably be provided as replacements for \Ti{} types. This will require transforming all code that creates and manipulates values in these types. Such transformations will probably be rather regular and straightforward (one might even imagine writing a program to do them). The replacements for the other \Tr{} types will probably also require rather regular and straightforward transformations of code that creates and manipulates their values. This subsection syntactically identifies most of the code fragments that create or manipulate values of \Tr{} types. \begin{enumerate} \item \bra a \Tr \ket [ \ldots ] \item \Tr \{ \ldots \} \item \code{NEW(REF \TC(\Tr), \ldots)} \item \code{NEW(\TCO(\Tr), \ldots)} \item \code{\bra a \Tr \ket{} binop \bra a \Tr \ket} \item \code{\ldots{} IN \bra a \Tr \ket} \end{enumerate} Fragments matching these patterns are flagged with messages of the following forms, respectively: \begin{enumerate} \item \tt indexing into changing array \item \tt cons of val in changing type \item \tt NEW of container of changing type \item \tt NEW of container of changing type \item \tt binary operation on val in changing type \item \tt binary operation on val in changing type \end{enumerate} \subsection{Expressions that require harder thought} \begin{enumerate} \item \Es not in a type construction \item use of an external \Tm \item \code{SUBARRAY(\bra a \Tr\ket, \ldots)} \item \code{ORD(\bra a \Tn{} var\ket)} \item \code{VAL(\ldots, \Tn)} \item \code{ADR(\bra a \TC(\Ts)\ket)} \item \code{LOOPHOLE(\bra a \TC(\Ts)\ket, \ldots)} \item \code{LOOPHOLE(\ldots, \TC(\Ts))} \end{enumerate} Fragments matching these patterns are flagged with messages of the following forms, respectively: \begin{enumerate} \item \tt Expr depends on NUM(CHAR) \item \tt Use of EXTERNAL Tm item {\em N} \item \tt SUBARRAY of a changing array \item \tt ORD(var in NUM(CHAR)-dependent type) \item \tt VAL(..., NUM(CHAR)-dependent type) \item \tt ADR of a CHAR-size-dependent type \item \tt LOOPHOLE from a CHAR-size-dependent type \item \tt LOOPHOLE to a CHAR-size-dependent type \end{enumerate} The first pattern says that an \Es{} that contributes to making a type \Tn{} or \Tr{} does not need to be flagged because the consequences of the type being \Tn{} or \Tr{} are flagged, but other expressions that are \Es{} should be flagged because these programs know nothing about the consequences of that expression being \Es. The second pattern is just an aid to tracking down the consequences of a declaration of an external \Tm{} item, which is flagged according to an earlier subsection. The third pattern is in this subsection, rather than the previous one, because there are data structures that can represent a mapping from \code{CHAR} to \code{X} but do not readily support an operation analogous to \code{SUBARRAY}\@. Pattern 4 is flagged because the range of possible results depends --- but only when applied to variables, because constants won't change their ordinal value. Pattern 5 is flagged because the range of valid values for the first argument depends (even when the first argument is constant). The last three patterns identify fragments that reveal \mt{}'s representation of \code{CHAR}. \subsection{Expressions distantly related to \code{CHAR}} These patterns identify code fragments that reveal \mt{}'s representation of a type related to, but not containing, a \code{CHAR}\@. These are definitely conservative. \begin{enumerate} \item \code{ADR(\bra a \Td \ket)} \item \code{LOOPHOLE(\bra a \Td \ket, \ldots)} \item \code{LOOPHOLE(\ldots, \Td)} \end{enumerate} Fragments matching these patterns are flagged with messages of the following forms, respectively: \begin{enumerate} \item \tt ADR of a type related to CHAR \item \tt LOOPHOLE from a type related to CHAR \item \tt LOOPHOLE to a type related to CHAR \end{enumerate} \end{document}