From 74c0aecca6fcf50cdd1493c0236562a93a449ed7 Mon Sep 17 00:00:00 2001 From: Rui Ueyama Date: Sun, 30 Aug 2020 18:49:29 +0900 Subject: [PATCH] Self-host: including preprocessor, chibicc can compile itself --- Makefile | 7 ++- README.md | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++++- self.py | 109 ----------------------------------------------- 3 files changed, 126 insertions(+), 114 deletions(-) delete mode 100755 self.py diff --git a/Makefile b/Makefile index a9bda5d..89778a3 100644 --- a/Makefile +++ b/Makefile @@ -14,7 +14,7 @@ chibicc: $(OBJS) $(OBJS): chibicc.h test/%.exe: chibicc test/%.c - ./chibicc -Itest -c -o test/$*.o test/$*.c + ./chibicc -Iinclude -Itest -c -o test/$*.o test/$*.c $(CC) -o $@ test/$*.o -xc test/common test: $(TESTS) @@ -28,10 +28,9 @@ test-all: test test-stage2 stage2/chibicc: $(OBJS:%=stage2/%) $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS) -stage2/%.o: chibicc self.py %.c +stage2/%.o: chibicc %.c mkdir -p stage2/test - ./self.py chibicc.h $*.c > stage2/$*.c - ./chibicc -c -o stage2/$*.o stage2/$*.c + ./chibicc -c -o $(@D)/$*.o $*.c stage2/test/%.exe: stage2/chibicc test/%.c mkdir -p stage2/test diff --git a/README.md b/README.md index 60eb404..a02d7f4 100644 --- a/README.md +++ b/README.md @@ -1 +1,123 @@ -This is the reference implementation of https://www.sigbus.info/compilerbook. +# chibicc: A Teaching C Compiler + +chibicc is a C compiler for educational purposes. I wrote it with the +following goals in mind: + +- Simple: The compiler should be as simple as possible to help the + reader understand how it works. + +- Small: The compiler should be small enough to be covered in a + semester. + +- Demonstrating an incremental approach: It's git history should start + from a minimal compiler implementation, and the compiler should gain + one feature at a time with a small incremental patch. That should + help the reader understand how to write a large program from + scratch, which requires a different kind of skill set than writing a + patch for an existing large project. + +- Correctness: It should correctly capture the semantics of the major + but obscure C language features, such as "usual arithmetic + conversion" or "arrays decay into pointers". + +- Completeness: While the compiler doesn't have to support all C + language features, it should be able to compile nontrivial programs + including itself. + +I believe all the above goals are met. chibicc's source code is in my +opinion small and pretty easy to read, and not only the current state +of the code but _every commit_ was written with readability in +mind. The first commit is a minimalistic compiler that compiles an +integer to a program that exits with the given number as the exit +code. Then I added operators (e.g. `+` or `-`), local variable, +control structures (e.g. `if` or `while`), function call, function +definition, global variable, and other language features one at a +time. As the compiler gained features with a series of small patches, +the language the compiler accepts looked more and more like the real C +language. + +When I found a bug in a previous commit, I edited the commit by +rewriting the git history instead of creating a new commit. This is an +unusual and undesirable development style for most projects, but for +my purpose, keeping clean commit history is more important than +avoiding git forced-pushes. + +chibicc's internal design was carefully chosen to naturally support +the core C language semantics. It supports many C language features +including the preprocessor. chibicc is written in C and can compile +itself. I didn't try to avoid certain C features when writing this +compiler for ease of self-hosting, so I can say that it can compile at +least one ordinary C program. + +Being said that, there are many missing features. They are left as an +exercise for the reader. + +## Internals + +chibicc consists of the following stages: + +- Tokenize: A tokenizer takes a string as an input, breaks it into + a list of tokens and returns them. + +- Preprocess: A preprocessor takes as an input a list of tokens and + output a new list of macro-expanded tokens. It interprets + preprocessor directives while expanding macros. + +- Parse: A recursive descendent parser constructs abstract syntax trees + from the output of the preprocessor. It also adds a type to each + AST node. + +- Codegen: A code generator emits an assembly text for given AST nodes. + +Currently, there's no optimization pass, but there's a plan to add one +to elimnate obvious inefficiencies in the chibicc's output. + +Note that chibicc allocates memory using malloc() but never calls +free(). Once memory is allocated, it won't be released until the +process exits. This may look like an odd design choice, and perhaps +it is, but in practice this memory management policy (or the lack of +thereof) works well for short-lived programs like chibicc. This design +eliminates all scaffolding and complexity of manual memory management +and makes the compiler much simpler than it would otherwise have been. +If the memory consumption becomes a real issue, you can plug in [Boehm +GC](https://en.wikipedia.org/wiki/Boehm_garbage_collector) for +automatic memory management. + +## Book + +I'm writing an online book about the C compiler. The draft is +available at https://www.sigbus.info/compilerbook, though currently it +is in Japanese. I have a plan to translate to English once it's +complete. + +## About the Author + +I'm Rui Ueyama. I'm the creator of [8cc](https://github.com/rui314/8cc), +which is a hobby C compiler, and also the original creator of the +current version of [LLVM lld](https://lld.llvm.org) linker, which is a +production-quality linker used by various operating systems and +large-scale build systems. + +## References + +- [tcc](https://bellard.org/tcc/): A small C compiler written by + Fabrice Bellard. I learned a lot from this compiler, but the design + of tcc and chibicc are largely different. In particular, tcc is a + one-pass compiler, while chibicc is a multi-pass one. + +- [lcc](https://github.com/drh/lcc): Another small C compiler. The + creators wrote a + [book](https://sites.google.com/site/lccretargetablecompiler/) about + the internals of lcc, which I found a good resource to see how a + compiler is implemented. + +- [An Incremental Approach to Compiler + Construction](http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf) + +## Project ideas + +- Add missing features +- Port to different ISAs such as RISC-V +- Rewrite the compiler in a different language than C +- Use LLVM as a backend +- Add optimization passes diff --git a/self.py b/self.py deleted file mode 100755 index 9199857..0000000 --- a/self.py +++ /dev/null @@ -1,109 +0,0 @@ -#!/usr/bin/python3 -import re -import sys - -print(""" -typedef signed char int8_t; -typedef short int16_t; -typedef int int32_t; -typedef long int64_t; - -typedef unsigned char uint8_t; -typedef unsigned short uint16_t; -typedef unsigned int uint32_t; -typedef unsigned long uint64_t; - -typedef unsigned long size_t; - -typedef struct FILE FILE; -extern FILE *stdin; -extern FILE *stdout; -extern FILE *stderr; - -typedef struct { - int gp_offset; - int fp_offset; - void *overflow_arg_area; - void *reg_save_area; -} __va_elem; - -typedef __va_elem va_list[1]; - -struct stat { - char _[512]; -}; - -void *malloc(long size); -void *calloc(long nmemb, long size); -void *realloc(void *buf, long size); -int *__errno_location(); -char *strerror(int errnum); -FILE *fopen(char *pathname, char *mode); -long fread(void *ptr, long size, long nmemb, FILE *stream); -int fclose(FILE *fp); -int feof(FILE *stream); -static void assert() {} -int strcmp(char *s1, char *s2); -int strncasecmp(char *s1, char *s2); -int printf(char *fmt, ...); -int sprintf(char *buf, char *fmt, ...); -int fprintf(FILE *fp, char *fmt, ...); -int vfprintf(FILE *fp, char *fmt, va_list ap); -long strlen(char *p); -int strncmp(char *p, char *q); -void *memcpy(char *dst, char *src, long n); -char *strdup(char *p); -char *strndup(char *p, long n); -char *strdup(char *p); -int isspace(int c); -int ispunct(int c); -int isdigit(int c); -int isxdigit(int c); -char *strstr(char *haystack, char *needle); -char *strchr(char *s, int c); -double strtod(char *nptr, char **endptr); -static void va_end(va_list ap) {} -long strtoul(char *nptr, char **endptr, int base); -void exit(int code); -char *basename(char *path); -char *strrchr(char *s, int c); -int unlink(char *pathname); -int mkstemp(char *template); -int close(int fd); -int fork(void); -int execvp(char *file, char **argv); -void _exit(int code); -int wait(int *wstatus); -int atexit(void (*)(void)); -FILE *open_memstream(char **ptr, size_t *sizeloc); -char *dirname(char *path); -char *strncpy(char *dest, char *src, long n); -int stat(char *pathname, struct stat *statbuf); -int stat(char *pathname, struct stat *statbuf); -char *dirname(char *path); -char *basename(char *path); -char *strrchr(char *s, int c); -int unlink(char *pathname); -int mkstemp(char *template); -int close(int fd); -int fork(void); -int execvp(char *file, char **argv); -void _exit(int code); -int wait(int *wstatus); -int atexit(void (*)(void)); -""") - -for path in sys.argv[1:]: - with open(path) as file: - s = file.read() - s = re.sub(r'\\\n', '', s) - s = re.sub(r'^\s*#.*', '', s, flags=re.MULTILINE) - s = re.sub(r'\bbool\b', '_Bool', s) - s = re.sub(r'\berrno\b', '*__errno_location()', s) - s = re.sub(r'\btrue\b', '1', s) - s = re.sub(r'\bfalse\b', '0', s) - s = re.sub(r'\bNULL\b', '0', s) - s = re.sub(r'\bva_start\(([^)]*),([^)]*)\)', '*(\\1)=*(__va_elem*)__va_area__', s) - s = re.sub(r'\bunreachable\b', 'error', s) - s = re.sub(r'\bMIN\(([^)]*),([^)]*)\)', '((\\1)<(\\2)?(\\1):(\\2))', s) - print(s) -- GitLab