iTranslated by AI
Building a Python Compiler #8 - Building the Object Model
Hello. Up to the previous post (#7.5 - Project Name Change and Layout Restructuring (Extra Edition)), we have covered the basics such as "Lexical Analysis -> AST Conversion -> LLVM IR Generation" and optimization through primitive type boxing/unboxing.
This time, I will finally talk about the introduction of a full-fledged "Object-Oriented Model." Specifically, I have started creating new files such as classobject.c, instanceobject.c, methodobject.c, and functionobject.c to reproduce classes, instances, and methods using a mechanism close to CPython.
I will summarize what this "Object Model Construction" aims for and what kind of overall structure it takes in a way that is as easy to understand as possible.
1. What kind of "Object Model" do I want to create?
1-1. Modeling after CPython's Object System
While this project has the theme of "converting Python code to LLVM IR and making it an executable file," the ultimate goal is to maintain affinity with major Python libraries and C extensions in the future.
Therefore, I have started to mimic the "CPython-style object system" quite faithfully as a base. Specifically:
-
Type structure centered around
PyObjectandPyVarObject- All objects share
ob_refcntandob_type. - Lists, dictionaries, etc., use
PyVarObjectto have an element countob_size.
- All objects share
-
PyTypeObjectfilled withtp_*(destructors and operator tables)- It contains members like
tp_call,tp_repr, andtp_as_numberto define "how this type behaves."
- It contains members like
-
Reference count management with
Py_INCREF/Py_DECREF+ Boehm GC- While CPython uses reference counting + GC (cycle detection), this project uses Boehm GC in combination to delegate the resolution of circular references.
- Even so, I still include
Py_INCREF,Py_DECREF, etc., with compatibility in mind.
By adopting such a design, it leaves open the possibility of partially implementing "APIs similar to CPython" at the C level or leveraging existing Python extension modules in the future.
1-2. Main Difference from Before
Previously, the stance was "int is LLVM i32" and "str (Python string) is PyUnicodeObject*—defining only minimal types in C and keeping the rest of the logic simple..."
The major change in this update is that I have started to reproduce the Pythonic object layout and dispatch mechanism, including "class" and "def."
As a result, many files have been added to runtime/builtin/objects/, and concepts such as PyClassObject, PyInstanceObject, PyMethodObject, and PyFunctionObject have appeared. By combining these:
- Represent the class body (
class MyClass:) with "PyClassObject+cl_dict(attribute dictionary)." - Generate instances as "
PyInstanceObject+in_dict" and handle calls tomyinstance.method()viaPyMethodObject. - Hold function objects (
def func()) in the form ofPyFunctionObject.
...I am aiming for an architecture that is quite close to CPython.
2. Main Newly Added C Files and Structs
2-1. classobject.c/h - Class Definitions
-
PyClassObject: A struct to represent a class.-
cl_name: Class name (e.g.,"MyClass"). -
cl_dict: Class dictionary (equivalent to__dict__) — holds methods and class variables. -
cl_bases: A list of base classes (to support inheritance).
-
-
PyClass_Type: A "class type object" that inherits fromPyTypeObject.- Assigns
class_calltotp_callto implement "class calling (= instantiation)."
- Assigns
-
PyClass_New(name, bases, dict): A function that generates a new class.
2-2. functionobject.c/h - Function Definitions
-
PyFunctionObject: A struct for holding a function.-
func_code: Actual function code (a function pointer for native functions; LLVM IR or bytecode in the future). -
func_name: Function name. -
func_defaults: Default arguments (some parts are still unimplemented).
-
-
PyFunction_New(...): Creates a new function object. -
PyFunction_FromNative(...): Used when registering a native C function as a Python function object.
2-3. methodobject.c/h - Method Binding
-
PyMethodObject: A bound method struct that bundles an "instance + function."-
im_func: Function (PyFunctionObject*). -
im_self: Instance (PyInstanceObject*). - Inserts
im_selfas the first argument upon calling viatp_call.
-
-
PyMethod_New(func, self): Binds a function and an instance to create a new method.
2-4. instanceobject.c/h - Instances
-
PyInstanceObject:-
in_class: Which class (PyClassObject*) it was generated from. -
in_dict: Attribute dictionary specific to this instance.
-
-
PyInstance_NewRaw/PyInstance_New: Functions that actually create instances.-
PyInstance_NewRawdoes not call__init__. -
PyInstance_Newsearches for and calls the__init__method.
-
Overall, this structure is in place, allowing the flow of class -> instance -> method call to be handled entirely at the C level.
3. How Class Definition and Instance Creation Work
3-1. Behind the Scenes of Class Definition (class MyClass:)
class_dict = PyDict_New()
- First, an empty dictionary is created to register the class's methods and variables.
- Read Method Definitions (FunctionDef) and Create
PyFunctionObject(or Native Function)
- Register them using
PyDict_SetItem(class_dict, method_name, function_object).
- Convert Class Name (
"MyClass") to a Python String (PyUnicodeObject) - Call
PyClass_New(class_name, bases, class_dict)to Get aPyClassObject
cl_name = "MyClass"cl_dict = class_dict-
cl_bases = ...(currently an empty list)
- Save the generated class object as a symbol like
%MyClassand register it in the symbol table.
In this way, "MyClass" can operate as a Python class object (PyClassObject) at runtime.
3-2. Instance Creation (x = MyClass())
- At the LLVM IR level,
%class_obj = load ptr, ptr %MyClass(load the class object). - Call
call ptr @PyObject_Call(ptr %class_obj, <args>, null)or jump directly toclass_call. - Inside
class_call,PyInstance_Newis called to create aPyInstanceObject. - If an
__init__method exists (cl_dict["__init__"]), call it.
At this point, a PyInstanceObject is returned and held as %x.
From the user program's perspective, it looks like an instance is created by "x = MyClass()", but in reality, the class's tp_call implementation runs the flow: instance creation -> __init__ call.
3-3. Method Calling (x.method(...))
- When retrieving
methodviainstance_getattro, aPyMethodObjectis created.(im_func=cl_dict["method"], im_self=x)
- At the time of calling (
method_call),selfis inserted at the beginning and executed. - Essentially, the native function or LLVM IR function inside the
PyFunctionObjectis called.
This is the mechanism of "bound methods." Since it follows the same flow as CPython, the Python behavior where "methods are stored in the class dictionary and automatically bound upon instance access" is reproduced.
4. Compiler (LLVM IR Generation) Changes
4-1. StmtVisitor.visit_ClassDef()
- When a
ClassDefnode is encountered, a class dictionary is created usingPyDict_New(), and methods are registered (PyDict_SetItem). - The class name is prepared using
PyUnicode_FromString. - Finally,
PyClass_Newis called to obtain a class object. - The class is made a local symbol using
%ClassName = alloca ptr; store ptr <class_obj>, ptr %ClassName.
After this flow is completed, the user-defined class is treated as a "pointer" within the program.
class_dict, __init__, and other components are managed at the C level, making it easy to extend dynamically.
4-2. ExprVisitor.visit_Call() -> Special Handling for "Class Calls"
- If the target being called is a class object (
PyClassObject), thentp_callpoints toclass_call->PyInstance_New(...). - While it is not yet strictly implemented to that extent, the ultimate plan is to check if it is
isinstance(target, PyClass_Type), and if it is a class, perform instance creation; otherwise, perform a function call.
This involves how the compiler decides between a "user function call" and a "class call (instantiation)." Currently, it simply checks whether %funcName is a registered symbol, but in the future, a mechanism to more clearly distinguish whether a symbol is a class or a function will be necessary.
5. Advantages and Future Extensions
5-1. Advantages
- Full-fledged Python classes and methods can be written
- The foundation is in place to convert forms like
class Foo: def method(self): ...into IR and actually run them.
- Existing Python object-oriented syntax is easy to adapt
- Since it is close to the CPython-compatible API,
tp_call,tp_getattro,tp_setattro, etc., can be utilized.
- Future integration with C extension modules
- By using the
PyTypeObjectform, it is possible that C extension libraries can be used without significant modification.
5-2. Future Extension Points
- Actual Bytecode / LLVM IR Function Calls
- How to manage
func_codeandfunc_globalswithin aPyFunctionObject. - Currently, only minimal native function pointers are supported, and complex Python bytecode execution is not yet implemented.
- Full Support for
__init__and__new__
- While there is a mechanism to call
__init__inPyInstance_NeworPyClass_Type.tp_call, handling of arguments and method lookup in inheritance chains is not yet sufficient.
- Exceptions and Inheritance Chains
- Handling
class Derived(Base)usingcl_bases, and strengthening the implementation aroundPyErr_*. - Another challenge is how to convert
try: ... except: ...into IR viaexceptions.c/excepthandler.c.
Summary & Future Outlook
- I introduced a CPython-like object-oriented model all at once, enabling the representation of classes, instances, and methods centered around
PyClassObject,PyInstanceObject,PyMethodObject, andPyFunctionObject. - The layout and function names, such as
classobject.c,methodobject.c,functionobject.c, andinstanceobject.c, closely resemble the CPython C API. - On the compiler (LLVM IR generation) side, I am adding processes for
ClassDef,FunctionDef, andCallto output "class dictionary preparation," "class creation viaPyClass_New," "instantiation (PyInstance_New)," and "method binding." - In the future, I plan to support more advanced Python behaviors, such as
__init__and__new__, the exception system, and bytecode-like function code.
In summary, the "skeleton" of Python's object-oriented features has begun to take shape with this update. By gradually extending the implementation of various methods, exceptions, and inheritance, a world very close to "CPython compatibility" seems to be coming into view.
In the next post, I plan to delve a bit deeper into the handling of the class dictionary (cl_dict) and look at how the templates for method registration and inheritance chains are constructed. Stay tuned!
-> To streamline future implementation, I have introduced MLIR and significantly updated the implementation. I intend to write about that next time!
Next:
Discussion