iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📚

Diving into the TypeScript Source Code

に公開

Have you ever felt like reading the source code of TypeScript, which we all rely on? I did.

https://github.com/microsoft/TypeScript

It took about a week. I can't say I've read every single line, but I've grasped the overview.

Since it was quite complex and gaining domain knowledge was difficult, I'd like to organize and introduce the key players and concepts for those planning to read it.

The code I read was cloned as of June 8, 2023.

First: Setting My Goal

I wanted to incrementally rewrite references across multiple files using findReferences(), findRenameLocations(), and goToDefinitions() provided by the TypeScript Language Service.

When using Terser, information about what the current object is or which members were renamed isn't preserved. Doing this at the TypeScript layer should allow for more aggressive minification.

However, within the scope of my brief usage, even though I had some vague knowledge, I lacked confidence in the internal lifecycle and implementation.

Basic Knowledge of the TypeScript Compiler API

Let's start from the entry point. Besides being used via tsc or VS Code, TypeScript provides a Compiler API as a module.

For example, the following code serves as a basic template.

import ts from "typescript";

// Load the config and target files based on the tsconfig.json in this directory
const tsconfig = ts.readConfigFile("./tsconfig.json", ts.sys.readFile);
const options = ts.parseJsonConfigFileContent(tsconfig.config, ts.sys, "./");

// Get the Compiler Host
const host = ts.createCompilerHost(options.options);
const program = ts.createProgram(options.fileNames, options.options, host);

// Get the type analysis results
for (const diagnostic of program.getSemanticDiagnostics()) {
  console.log(
    "[semantic diagnostics]",
    diagnostic.file?.fileName.replace(process.cwd(), ""),
    `#${diagnostic.start}:${diagnostic.length}`,
    "-",
    diagnostic.messageText,
    indexSourceFile.getLineAndCharacterOfPosition(diagnostic.start!),
  );
}

I will explain the key players based on this.

Essentially, you'll be reading src/compiler/*.ts. src/compiler/types.ts is particularly helpful, as almost all public interface types are defined there.

ts.System

This defines the API types representing the environment, such as readFile and writeFile, expressed as ts.System. The ts.sys loaded in a Node.js environment implements ts.System for Node.js. When loaded outside of Node.js (like in a browser or Deno), ts.sys will be empty, but you can run the compiler in any environment by implementing it yourself to satisfy the type requirements. At the very least, you don't even need a FileSystem backend; for example, the TS Playground runs on an in-memory mock.

https://www.typescriptlang.org/play

ts.CompilerHost

The most basic host abstraction for the compiler environment. It takes or requires ts.sys as an argument.

By providing this CompilerHost as an argument, you can run components like Program, BuilderProgram, WatcherProgram, and LanguageService, which will be discussed later.

There are several create~Host functions, which generally pair with a corresponding create~Program. When reading with a specific goal, you will mostly aim to generate the ~Host for the ~Program you intend to use.

ts.SourceFile

An abstraction linked 1:1 with a file.

const newSource = ts.createSourceFile(
    'index.ts',
    'export const x: number = 1;',
    ts.ScriptTarget.ESNext,
);

A file generated with ts.createSourceFile cannot perform analysis across multiple files on its own. If you want to manipulate multiple files, you generally use ts.Program().

(As you'll see if you read the source code, a vast amount of analyzed information is written into this object.)

However, basic AST manipulation is possible with this alone.

For example, code transformation using a transformer like the following doesn't require types, so this is sufficient for just playing around with the AST.

import ts from "typescript";

const source = ts.createSourceFile(
  "test.ts",
  `export const x = 1`,
  ts.ScriptTarget.ESNext,
);

const transformerFactory = (context: ts.TransformationContext) => {
  const visit: ts.Visitor = (node) => {
    if (ts.isVariableDeclaration(node)) {
      return ts.factory.createVariableDeclaration(
        node.name,
        undefined,
        undefined,
        ts.factory.createNumericLiteral("2"),
      );
    }
    return ts.visitEachChild(node, visit, context);
  };
  return (node: ts.SourceFile) => ts.visitNode(node, visit);
};

const result = ts.transform(source, [
  transformerFactory as ts.TransformerFactory<ts.SourceFile>,
]);

const transformedSource = result.transformed[0];
const printer = ts.createPrinter({
  newLine: ts.NewLineKind.LineFeed,
});

console.log(printer.printFile(transformedSource));

ts.Program

A program abstraction composed of multiple SourceFiles. Through this, dependency analysis across multiple files and type analysis become possible.

Simple usage:

const program = ts.createProgram(...);
const diagnostics = program.getSemanticDiganostics();
const soureFiles = program.getSourceFiles();
// Output
program.emit();

Internally, it uses the host's IO abstraction to read files like src/index.ts and analyzes which files are called by which, and whether types are broken.

emit uses the writeFile implementation if it's provided in the CompilerHost.

This is equivalent to running tsc -p ..

ts.TypeChecker

This is the actual entity for type checking used inside ts.Program. On the surface, the available APIs are limited, as it is intended to be used via ts.Program.

If used directly, you would likely use typeChecker.getSymbolsInScope(). This allows you to pull out specific symbols for nodes in a file.

const typeChecker = program.getTypeChecker();
const symobls = typeChecker.getSymbolsInScope(
    sourceFile,
    ts.Symbolchecks.Type,
);

That said, it is difficult to use without understanding the internal implementation of TypeScript itself.

What was interesting while reading the code was node.checks. A node is the basic form of an AST Node (ts.Node), and the states it can take after analysis (e.g., string | number) are written as bit flags.

This means that when dealing with primitive types to some extent, the calculation of a union type A | B is, in actual computation, nodeA.checks | nodeB.checks, and an intersection type A & B is nodeA.checks & nodeB.checks. In practice, it's much more complex when combined with Object types, but you could say the basis is just doing that between members.

Also, node.links contains connection information between nodes. By tracing this, control flow analysis—such as if (x != null) { /* scope where x is not null */ }—is realized.

My distinction between ts.Node and ts.Symbol is a bit vague, but they are basically the same; the difference lies primarily in whether the operation is performed on the AST Node or the declared Symbol.

ts.BuilderProgram

Actually, Program itself abstracts a one-time state of analysis performed on the current SourceFiles, and it cannot re-examine after the contents of a SourceFile are modified.

The internal abstraction for modifying SourceFiles across multiple iterations is BuilderProgram.

const builder = ts.createSemanticDiagnosticsBuilderProgram(
  options.fileNames,
  options.options,
  host,
);
const program1 = builder.getProgram();
// ...
const program2 = builder.getProgram();

// They might not match
console.log(program1 === program2)

It looks easy to create, but in reality, BuilderProgram itself doesn't contain the logic to notify the internal system about changes. WatchProgram and LanguageService are responsible for wrapping the CompilerHost passed to the BuilderProgram in a way that facilitates incremental operations. (It took me a huge amount of time to understand this.)

Rebuilding a program might seem costly, but createProgram can take another program as an argument and inherit its internal cache of SourceFiles and analysis results.

ts.WatchProgram

Equivalent to tsc --watch.

It primarily monitors the FileSystem via Node.js and notifies the internal BuilderProgram of changes through callbacks.
File change notifications are buffered in 250ms units, and the internal cache is discarded and reloaded.

The internal cache is represented by ts.ModuleResolution, but there is no way to touch it directly from the surface.

ts.LanguageService

The LanguageService part of LSP. It doesn't strictly speak the LSP interface; that is handled by ts.Server. In fact, it seems the LSP specification was established after ts.LanguageService, so there are subtle differences.

For example, the logic for VS Code's findReferences and goToDefinition is implemented on top of this using the TypeChecker's analysis results. It also handles completions—like determining how to complete the rest of a token.

To use this, a special CompilerHost called LanguageServiceHost is required. The following documentation is helpful; in fact, it's the only public documentation available.

https://github.com/microsoft/TypeScript/wiki/Using-the-Compiler-API#incremental-build-support-using-the-language-services

  // Create the language service host to allow the LS to communicate with the host
  const servicesHost: ts.LanguageServiceHost = {
    getScriptFileNames: () => rootFileNames,
    getScriptVersion: fileName =>
      files[fileName] && files[fileName].version.toString(),
    getScriptSnapshot: fileName => {
      if (!fs.existsSync(fileName)) {
        return undefined;
      }

      return ts.ScriptSnapshot.fromString(fs.readFileSync(fileName).toString());
    },
    getCurrentDirectory: () => process.cwd(),
    getCompilationSettings: () => options,
    getDefaultLibFileName: options => ts.getDefaultLibFilePath(options),
    fileExists: ts.sys.fileExists,
    readFile: ts.sys.readFile,
    readDirectory: ts.sys.readDirectory,
    directoryExists: ts.sys.directoryExists,
    getDirectories: ts.sys.getDirectories,
  };

getScriptVersion and getScriptSnapshot are the keys; when a document is rewritten internally, the version corresponding to that file is changed. This causes getScriptSnapshot to be called again, which then returns the new string.

I wanted to know how to handle this with a proper understanding, so I worked hard to decipher it. As a result, this is what I understood:

  • LanguageService regenerates internal programs in response to source code changes.
  • SourceFile is managed in a documentRegistry, and changes can be notified by updating this along with scriptSnapshot and scriptVersion.
  • Program and TypeChecker are recreated every time the source code changes.
  • By defining getChangeRange in IScriptSnapshot, you can return the change range, and the IncrementalParser uses this to partially rebuild the AST.

I will explain how to use this effectively in the "Practical Edition" section later.

ts.Server

It mainly provides the interface for communicating with VS Code. I haven't read the source code for this. It's not included in the Compiler API, but if you want to use the APIs implemented there, you can technically call them directly from typescript/lib/tsserver.js.

import ts from "typescript/lib/tsserver.js";
console.log(ts.server); // It has more features than the standard Compiler API

Practical Edition: ts.LanguageService

Using the knowledge gained so far, I implemented a LanguageService that builds an in-memory cache and performs incremental rewriting.

import ts from "typescript";
import fs from "node:fs";
import path from "node:path";
import { DocumentRegistry } from "typescript";

const tsconfig = ts.readConfigFile("./tsconfig.json", ts.sys.readFile);
const options = ts.parseJsonConfigFileContent(tsconfig.config, ts.sys, "./");
const defaultHost = ts.createCompilerHost(options.options);

const expandPath = (fname: string) => {
  if (fname.startsWith("/")) {
    return fname;
  }
  const root = process.cwd();
  return path.join(root, fname);
};

function applyRenameLocations(
  code: string,
  toName: string,
  renameLocations: readonly ts.RenameLocation[],
) {
  let current = code;
  let offset = 0;
  for (const loc of renameLocations) {
    const start = loc.textSpan.start;
    const end = loc.textSpan.start + loc.textSpan.length;
    current = current.slice(0, start + offset) + toName +
      current.slice(end + offset);
    offset += toName.length - (end - start);
  }
  return current;
}

type SnapshotManager = {
  readFileSnapshot(fileName: string): string | undefined;
  writeFileSnapshot(fileName: string, content: string): ts.SourceFile;
};

export interface InMemoryLanguageServiceHost extends ts.LanguageServiceHost {
  getSnapshotManager: (
    registory: DocumentRegistry,
  ) => SnapshotManager;
}

export function createInMemoryLanguageServiceHost(): InMemoryLanguageServiceHost {
  // read once, write on memory
  const fileContents = new Map<string, string>();
  const fileSnapshots = new Map<string, ts.IScriptSnapshot>();
  const fileVersions = new Map<string, number>();
  const fileDirtySet = new Set<string>();

  const getSnapshotManagerInternal: (
    registory: DocumentRegistry,
  ) => SnapshotManager = (registory: ts.DocumentRegistry) => {
    return {
      readFileSnapshot(fileName: string) {
        fileName = expandPath(fileName);
        console.log("[readFileSnapshot]", fileName);
        if (fileContents.has(fileName)) {
          return fileContents.get(fileName) as string;
        }
        return defaultHost.readFile(fileName);
      },
      writeFileSnapshot(fileName: string, content: string) {
        fileName = expandPath(fileName);
        const nextVersion = (fileVersions.get(fileName) || 0) + 1;
        // fileVersions.set(fileName, nextVersion);
        fileContents.set(fileName, content);
        console.log(
          "[writeFileSnapshot]",
          fileName,
          nextVersion,
          content.length,
        );
        fileDirtySet.add(fileName);
        const newSource = registory.updateDocument(
          fileName,
          serviceHost,
          ts.ScriptSnapshot.fromString(content),
          String(nextVersion),
        );
        return newSource;
      },
    };
  };

  const serviceHost: InMemoryLanguageServiceHost = {
    getDefaultLibFileName: defaultHost.getDefaultLibFileName,
    fileExists: ts.sys.fileExists,
    readDirectory: ts.sys.readDirectory,
    directoryExists: ts.sys.directoryExists,
    getDirectories: ts.sys.getDirectories,
    getCurrentDirectory: defaultHost.getCurrentDirectory,
    getScriptFileNames: () => options.fileNames,
    getCompilationSettings: () => options.options,
    readFile: (fname, encode) => {
      fname = expandPath(fname);
      // console.log("[readFile]", fname);
      if (fileContents.has(fname)) {
        return fileContents.get(fname) as string;
      }
      const rawFileResult = ts.sys.readFile(fname, encode);
      if (rawFileResult) {
        fileContents.set(fname, rawFileResult);
        fileVersions.set(
          fname,
          (fileVersions.get(fname) || 0) + 1,
        );
      }
      return rawFileResult;
    },
    writeFile: (fileName, content) => {
      fileName = expandPath(fileName);
      console.log("[writeFile:mock]", fileName, content.length);
    },
    getScriptSnapshot: (fileName) => {
      fileName = expandPath(fileName);
      if (fileName.includes("src/index.ts")) {
        console.log("[getScriptSnapshot]", fileName);
      }
      if (fileSnapshots.has(fileName)) {
        return fileSnapshots.get(fileName)!;
      }
      const contentCache = fileContents.get(fileName);
      if (contentCache) {
        const newSnapshot = ts.ScriptSnapshot.fromString(contentCache);
        fileSnapshots.set(fileName, newSnapshot);
        return newSnapshot;
      }
      if (!fs.existsSync(fileName)) return;
      const raw = ts.sys.readFile(fileName, "utf8")!;
      const snopshot = ts.ScriptSnapshot.fromString(raw);
      fileSnapshots.set(fileName, snopshot);
      return snopshot;
    },
    getScriptVersion: (fileName) => {
      fileName = expandPath(fileName);
      const isDirty = fileDirtySet.has(fileName);
      if (isDirty) {
        const current = fileVersions.get(fileName) || 0;
        fileDirtySet.delete(fileName);
        fileVersions.set(fileName, current + 1);
      }
      return (fileVersions.get(fileName) || 0).toString();
    },
    getSnapshotManager: getSnapshotManagerInternal,
  };
  return serviceHost;
}

This implementation returns from the FileSystem only for the initial load, and thereafter rewrites the in-memory cache using a dedicated SnapshotManager. I went through some trial and error, so it could probably be simplified further.

Following the implementation patterns of TypeScript itself, I created a new interface and a create~ function that consumes the result of create~Host.

I prepared a src/index.ts file like this (intentionally violating types):

const x: number = "";

Here is an example of code using it:

// ...Continued
{
  // usage
  const prefs: ts.UserPreferences = {};
  const registory = ts.createDocumentRegistry();
  const serviceHost = createInMemoryLanguageServiceHost();
  const languageService = ts.createLanguageService(
    serviceHost,
    registory,
  );

  // languageService.
  const snapshotManager = serviceHost.getSnapshotManager(registory);

  // write src/index.ts and check types
  const raw = snapshotManager.readFileSnapshot("src/index.ts");
  const newSource = snapshotManager.writeFileSnapshot(
    "src/index.ts",
    raw + "\nconst y: number = x;",
  );

  // find scoped variables

  // languageService.getSemanticDiagnostics("src/index.ts");
  const program = languageService.getProgram()!;
  const checker = program.getTypeChecker();
  const localVariables = checker.getSymbolsInScope(
    newSource,
    ts.SymbolFlags.BlockScopedVariable,
  );

  // rename x to x_?
  const symbol = localVariables.find((s) => s.name === "x")!;
  const renameLocations = languageService.findRenameLocations(
    "src/index.ts",
    symbol.valueDeclaration!.getStart(),
    false,
    false,
    prefs,
  );
  const targets = new Set(renameLocations!.map((loc) => loc.fileName));

  let current = snapshotManager.readFileSnapshot("src/index.ts")!;
  for (const target of targets) {
    const renameLocationsToTarget = renameLocations!.filter(
      (loc) => expandPath(target) === expandPath(loc.fileName),
    );
    const newSymbolName = `${symbol.name}_${
      Math.random().toString(36).slice(2)
    }`;
    current = applyRenameLocations(
      current,
      newSymbolName,
      renameLocationsToTarget,
    );
  }
  snapshotManager.writeFileSnapshot("src/index.ts", current);
  const result = languageService.getSemanticDiagnostics("src/index.ts");
  console.log("post error", result.length);
  console.log(snapshotManager.readFileSnapshot("src/index.ts"));

  const oldProgram = program;
  {
    // rename y to y_?
    const program = languageService.getProgram()!;
    const program2 = languageService.getProgram()!;
    console.log(
      "------- program updated",
      program !== oldProgram,
      program2 === program,
    );
    const checker = program.getTypeChecker();
    const newSource = program.getSourceFile("src/index.ts")!;
    const localVariables = checker.getSymbolsInScope(
      newSource,
      ts.SymbolFlags.BlockScopedVariable,
    );
    const symbol = localVariables.find((s) => s.name === "y")!;
    const renameLocations = languageService.findRenameLocations(
      "src/index.ts",
      symbol.valueDeclaration!.getStart(),
      false,
      false,
      prefs,
    );
    const targets = new Set(renameLocations!.map((loc) => loc.fileName));
    let current = snapshotManager.readFileSnapshot("src/index.ts")!;
    for (const target of targets) {
      const renameLocationsToTarget = renameLocations!.filter(
        (loc) => expandPath(target) === expandPath(loc.fileName),
      );
      const newSymbolName = `${symbol.name}_${
        Math.random().toString(36).slice(2)
      }`;
      current = applyRenameLocations(
        current,
        newSymbolName,
        renameLocationsToTarget,
      );
    }
    snapshotManager.writeFileSnapshot("src/index.ts", current);
    const result = languageService.getSemanticDiagnostics("src/index.ts");
    console.log("post error", result.length);
    console.log(snapshotManager.readFileSnapshot("src/index.ts"));
  }
}

The second rewrite is redundant because I just copy-pasted it to save effort, but basically, it rewrites the code three times.

First, add a line for y that references x:

const x: number = "";
const y: number = x;

Rewrite the variable name of x with a random suffix:

const x_3fo8yrgzd8u: number = "";
const y: number = x_3fo8yrgzd8u;

It is correctly rewritten in both places.

Then, rewrite y as well. (At this point, a program regeneration was necessary to update the renameInfo.)

const x_3fo8yrgzd8u: number = "xxxx";

const y_o7708up7yh8: number = x_3fo8yrgzd8u;

It becomes like this.

Even after these operations, the local src/index.ts remains unchanged.

With this, I was able to achieve what I originally set out to do.

Reflections on Reading the Source Code

Since the TypeScript core was in the middle of migrating to ESM, there were some incomplete parts, but I feel like I've gained knowledge that will last a lifetime.

I think it's a treasure trove, especially regarding internal implementations like IncrementalParsing, the type checker, and cache inheritance during program updates.

Compiler code might seem difficult, but if you have a certain level of knowledge from using TypeScript, you can often push through using your domain knowledge as a user.

The biggest takeaway is that I can now see through to how the code I'm currently writing is represented internally within the type checker.

Tips for Reading the Actual Source Code

  • When you encounter a new type or concept, first check src/compiler/types.ts. Basic types are defined here, so you'll naturally find yourself jumping back to this file.
  • Even if you are familiar with the TS Compiler API, the actual implementation behavior may defy your expectations. In particular, interfaces marked with /** @internal */ comments exist in the implementation but are omitted from the public API, and these can be crucial.
    • SourceFile, Node, and Symbol especially contain a lot of internal metadata.
  • When a function called findXxxWorker is called within a function named findXxx, and the call is wrapped in tracing... or Performance..., it's a signal that findXxxWorker has a high processing cost. I recommend bracing yourself and being prepared. Also, such functions often involve recursion; in those cases, focus back on the findXxx side.
  • When reading functions that generate large modules, reading from the top can be overwhelming. There's a trick:
    • The first half usually deals with various cache inheritances, while the latter half handles new generation. For functions like createProgram, you might want to start reading from the latter half.
    • While Program is the fundamental unit, reading the logic for inheriting from oldProgram was tough. It might be enough to have a rough understanding like "it's inheriting sourceFiles."
  • Read the code around TypeChecker's type inference assuming it primarily uses bitmasks. There aren't any difficult bitwise operations, but there are simply many steps involved.
  • The Parser is a daunting maze of hand-written recursive descent parsing. Unless you absolutely need to (like if you want to implement new syntax), you can skip it.
  • Most of the implementation is written using functions that hide internal state within their local scopes. There are almost no classes. Since state is hidden in function scopes, when reading functions that generate large units, check what variables exist in the local scope as you read.
  • The module resolution procedure implemented in src/compiler/moduleResolver.ts is somewhat intuitive for those who know the relationship between Node.js and ESM, but both the specification and implementation are extremely complex and difficult. I skipped through it, treating it as "just how it works."

End

I encourage everyone to try reading it! And here is one important tip: since it's a 10-year-old repository, git clone is extremely heavy. Make sure to use git clone --depth 1 --branch main!

Scraps from when I was reading:

https://zenn.dev/mizchi/scraps/3c30ea6fa9f8e5
https://zenn.dev/mizchi/scraps/80d47cbc601a5f
https://zenn.dev/mizchi/scraps/9f0f6b3b08bff6

Discussion