컴파일러나 인터프리터나 스크립터 만들려면??

어디서부터 시작하는게 좋을까요?
어떤 책을 보는게 좋을까요?
어떤 소스를 분석하는게 좋을까요?

lex. yacc. 머 이런 소스를 보고 해보라고 하는데...
막상 소스봐도 잘 이해가 안갑니다.

실제로 해보고 싶은건 php소스를 분석해서
제 나름대로 비슷한걸 만들고 사용해보고 싶습니다.

일단 이 책을 사서 언제나 들고 다니면서 읽으심이 좋을 것 같습니다.

1년동안 열심히보고. 제대로 읽고 제대로 쓸 수 있다면 좋겠습니다.

오호라의 이미지

학문적 접근을 한다면 책을 추천해드립니다. 종류도 몇권안되니 쉽게 찾아볼 수 있습니다.

구현적인 접근이라면 구글에서 "Compiler Tool" 로 검색해보시면 관련내용을 확인하실 수 있습니다.

그림과 같이 크게 Front-end, Back-end 로 나뉩니다.
( http://en.wikipedia.org/wiki/Image:Compiler.svg )

책을 몇권 읽어보시고 각각의 항목에 해당하는 Tools 을 소스를 보시면 도움이 될겁니다.

ps. 몇몇 compiler tool 들은 문법( BNF, lex & yacc )만 정의하면 제한된 플랫픔(x86, VM, ... )에서 가능한 컴파일러를 만들어 줍니다.

Front end
The front end analyzes the source code to build an internal representation of the program, called the intermediate representation or IR. It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope. This is done over several phases, which includes some of the following:
   1. Line reconstruction. Languages which strop their keywords or allow arbitrary spaces within identifiers require a phase before parsing, which converts the input character sequence to a canonical form ready for the parser. The top-down, recursive-descent, table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. Atlas Autocode, and Imp (and some implementations of Algol and Coral66) are examples of stropped languages whose compilers would have a Line Reconstruction phase.
   2. Lexical analysis breaks the source code text into small pieces called tokens. Each token is a single atomic unit of the language, for instance a keyword, identifier or symbol name. The token syntax is typically a regular language, so a finite state automaton constructed from a regular expression can be used to recognize it. This phase is also called lexing or scanning, and the software doing lexical analysis is called a lexical analyzer or scanner.
   3. Preprocessing. Some languages, e.g., C, require a preprocessing phase which supports macro substitution and conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as Scheme support macro substitutions based on syntactic forms.
   4. Syntax analysis involves parsing the token sequence to identify the syntactic structure of the program. This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language's syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler.
   5. Semantic analysis is the phase in which the compiler adds semantic information to the parse tree and builds the symbol table. This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsing phase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.
[edit] Back end
The term back end is sometimes confused with code generator because of the overlapped functionality of generating assembly code. Some literature uses middle end to distinguish the generic analysis and optimization phases in the back end from the machine-dependent code generators.
The main phases of the back end include the following:
   1. Analysis: This is the gathering of program information from the intermediate representation derived from the input. Typical analyses are data flow analysis to build use-define chains, dependence analysis, alias analysis, pointer analysis, escape analysis etc. Accurate analysis is the basis for any compiler optimization. The call graph and control flow graph are usually also built during the analysis phase.
   2. Optimization: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Popular optimizations are inline expansion, dead code elimination, constant propagation, loop transformation, register allocation or even automatic parallelization.
   3. Code generation: the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes (see also Sethi-Ullman algorithm).

Hello World.

아직은 감잡으려면 노력이 필요한것 같습니다.

