iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐍

Building a Python Compiler #8 - Building the Object Model

に公開

Hello. Up to the previous post (#7.5 - Project Name Change and Layout Restructuring (Extra Edition)), we have covered the basics such as "Lexical Analysis -> AST Conversion -> LLVM IR Generation" and optimization through primitive type boxing/unboxing.
This time, I will finally talk about the introduction of a full-fledged "Object-Oriented Model." Specifically, I have started creating new files such as classobject.c, instanceobject.c, methodobject.c, and functionobject.c to reproduce classes, instances, and methods using a mechanism close to CPython.
I will summarize what this "Object Model Construction" aims for and what kind of overall structure it takes in a way that is as easy to understand as possible.

1. What kind of "Object Model" do I want to create?

1-1. Modeling after CPython's Object System

While this project has the theme of "converting Python code to LLVM IR and making it an executable file," the ultimate goal is to maintain affinity with major Python libraries and C extensions in the future.
Therefore, I have started to mimic the "CPython-style object system" quite faithfully as a base. Specifically:

  • Type structure centered around PyObject and PyVarObject
    • All objects share ob_refcnt and ob_type.
    • Lists, dictionaries, etc., use PyVarObject to have an element count ob_size.
  • PyTypeObject filled with tp_* (destructors and operator tables)
    • It contains members like tp_call, tp_repr, and tp_as_number to define "how this type behaves."
  • Reference count management with Py_INCREF/Py_DECREF + Boehm GC
    • While CPython uses reference counting + GC (cycle detection), this project uses Boehm GC in combination to delegate the resolution of circular references.
    • Even so, I still include Py_INCREF, Py_DECREF, etc., with compatibility in mind.

By adopting such a design, it leaves open the possibility of partially implementing "APIs similar to CPython" at the C level or leveraging existing Python extension modules in the future.

1-2. Main Difference from Before

Previously, the stance was "int is LLVM i32" and "str (Python string) is PyUnicodeObject*—defining only minimal types in C and keeping the rest of the logic simple..."
The major change in this update is that I have started to reproduce the Pythonic object layout and dispatch mechanism, including "class" and "def."

As a result, many files have been added to runtime/builtin/objects/, and concepts such as PyClassObject, PyInstanceObject, PyMethodObject, and PyFunctionObject have appeared. By combining these:

  • Represent the class body (class MyClass:) with "PyClassObject + cl_dict (attribute dictionary)."
  • Generate instances as "PyInstanceObject + in_dict" and handle calls to myinstance.method() via PyMethodObject.
  • Hold function objects (def func()) in the form of PyFunctionObject.

...I am aiming for an architecture that is quite close to CPython.

2. Main Newly Added C Files and Structs

2-1. classobject.c/h - Class Definitions

  • PyClassObject: A struct to represent a class.
    • cl_name: Class name (e.g., "MyClass").
    • cl_dict: Class dictionary (equivalent to __dict__) — holds methods and class variables.
    • cl_bases: A list of base classes (to support inheritance).
  • PyClass_Type: A "class type object" that inherits from PyTypeObject.
    • Assigns class_call to tp_call to implement "class calling (= instantiation)."
  • PyClass_New(name, bases, dict): A function that generates a new class.

2-2. functionobject.c/h - Function Definitions

  • PyFunctionObject: A struct for holding a function.
    • func_code: Actual function code (a function pointer for native functions; LLVM IR or bytecode in the future).
    • func_name: Function name.
    • func_defaults: Default arguments (some parts are still unimplemented).
  • PyFunction_New(...): Creates a new function object.
  • PyFunction_FromNative(...): Used when registering a native C function as a Python function object.

2-3. methodobject.c/h - Method Binding

  • PyMethodObject: A bound method struct that bundles an "instance + function."
    • im_func: Function (PyFunctionObject*).
    • im_self: Instance (PyInstanceObject*).
    • Inserts im_self as the first argument upon calling via tp_call.
  • PyMethod_New(func, self): Binds a function and an instance to create a new method.

2-4. instanceobject.c/h - Instances

  • PyInstanceObject:
    • in_class: Which class (PyClassObject*) it was generated from.
    • in_dict: Attribute dictionary specific to this instance.
  • PyInstance_NewRaw / PyInstance_New: Functions that actually create instances.
    • PyInstance_NewRaw does not call __init__.
    • PyInstance_New searches for and calls the __init__ method.

Overall, this structure is in place, allowing the flow of class -> instance -> method call to be handled entirely at the C level.

3. How Class Definition and Instance Creation Work

3-1. Behind the Scenes of Class Definition (class MyClass:)

  1. class_dict = PyDict_New()
  • First, an empty dictionary is created to register the class's methods and variables.
  1. Read Method Definitions (FunctionDef) and Create PyFunctionObject (or Native Function)
  • Register them using PyDict_SetItem(class_dict, method_name, function_object).
  1. Convert Class Name ("MyClass") to a Python String (PyUnicodeObject)
  2. Call PyClass_New(class_name, bases, class_dict) to Get a PyClassObject
  • cl_name = "MyClass"
  • cl_dict = class_dict
  • cl_bases = ... (currently an empty list)
  1. Save the generated class object as a symbol like %MyClass and register it in the symbol table.

In this way, "MyClass" can operate as a Python class object (PyClassObject) at runtime.

3-2. Instance Creation (x = MyClass())

  1. At the LLVM IR level, %class_obj = load ptr, ptr %MyClass (load the class object).
  2. Call call ptr @PyObject_Call(ptr %class_obj, <args>, null) or jump directly to class_call.
  3. Inside class_call, PyInstance_New is called to create a PyInstanceObject.
  4. If an __init__ method exists (cl_dict["__init__"]), call it.

At this point, a PyInstanceObject is returned and held as %x.
From the user program's perspective, it looks like an instance is created by "x = MyClass()", but in reality, the class's tp_call implementation runs the flow: instance creation -> __init__ call.

3-3. Method Calling (x.method(...))

  • When retrieving method via instance_getattro, a PyMethodObject is created.
    • (im_func=cl_dict["method"], im_self=x)
  • At the time of calling (method_call), self is inserted at the beginning and executed.
  • Essentially, the native function or LLVM IR function inside the PyFunctionObject is called.

This is the mechanism of "bound methods." Since it follows the same flow as CPython, the Python behavior where "methods are stored in the class dictionary and automatically bound upon instance access" is reproduced.

4. Compiler (LLVM IR Generation) Changes

4-1. StmtVisitor.visit_ClassDef()

  • When a ClassDef node is encountered, a class dictionary is created using PyDict_New(), and methods are registered (PyDict_SetItem).
  • The class name is prepared using PyUnicode_FromString.
  • Finally, PyClass_New is called to obtain a class object.
  • The class is made a local symbol using %ClassName = alloca ptr; store ptr <class_obj>, ptr %ClassName.

After this flow is completed, the user-defined class is treated as a "pointer" within the program.
class_dict, __init__, and other components are managed at the C level, making it easy to extend dynamically.

4-2. ExprVisitor.visit_Call() -> Special Handling for "Class Calls"

  • If the target being called is a class object (PyClassObject), then tp_call points to class_call -> PyInstance_New(...).
  • While it is not yet strictly implemented to that extent, the ultimate plan is to check if it is isinstance(target, PyClass_Type), and if it is a class, perform instance creation; otherwise, perform a function call.

This involves how the compiler decides between a "user function call" and a "class call (instantiation)." Currently, it simply checks whether %funcName is a registered symbol, but in the future, a mechanism to more clearly distinguish whether a symbol is a class or a function will be necessary.

5. Advantages and Future Extensions

5-1. Advantages

  1. Full-fledged Python classes and methods can be written
  • The foundation is in place to convert forms like class Foo: def method(self): ... into IR and actually run them.
  1. Existing Python object-oriented syntax is easy to adapt
  • Since it is close to the CPython-compatible API, tp_call, tp_getattro, tp_setattro, etc., can be utilized.
  1. Future integration with C extension modules
  • By using the PyTypeObject form, it is possible that C extension libraries can be used without significant modification.

5-2. Future Extension Points

  1. Actual Bytecode / LLVM IR Function Calls
  • How to manage func_code and func_globals within a PyFunctionObject.
  • Currently, only minimal native function pointers are supported, and complex Python bytecode execution is not yet implemented.
  1. Full Support for __init__ and __new__
  • While there is a mechanism to call __init__ in PyInstance_New or PyClass_Type.tp_call, handling of arguments and method lookup in inheritance chains is not yet sufficient.
  1. Exceptions and Inheritance Chains
  • Handling class Derived(Base) using cl_bases, and strengthening the implementation around PyErr_*.
  • Another challenge is how to convert try: ... except: ... into IR via exceptions.c / excepthandler.c.

Summary & Future Outlook

  • I introduced a CPython-like object-oriented model all at once, enabling the representation of classes, instances, and methods centered around PyClassObject, PyInstanceObject, PyMethodObject, and PyFunctionObject.
  • The layout and function names, such as classobject.c, methodobject.c, functionobject.c, and instanceobject.c, closely resemble the CPython C API.
  • On the compiler (LLVM IR generation) side, I am adding processes for ClassDef, FunctionDef, and Call to output "class dictionary preparation," "class creation via PyClass_New," "instantiation (PyInstance_New)," and "method binding."
  • In the future, I plan to support more advanced Python behaviors, such as __init__ and __new__, the exception system, and bytecode-like function code.

In summary, the "skeleton" of Python's object-oriented features has begun to take shape with this update. By gradually extending the implementation of various methods, exceptions, and inheritance, a world very close to "CPython compatibility" seems to be coming into view.
In the next post, I plan to delve a bit deeper into the handling of the class dictionary (cl_dict) and look at how the templates for method registration and inheritance chains are constructed. Stay tuned!
-> To streamline future implementation, I have introduced MLIR and significantly updated the implementation. I intend to write about that next time!

Next:
https://zenn.dev/t3tra/articles/05fcc322102215

Discussion