From Camera to Clip Space - Derivation of the Projection Matrices
We derive the orthographic and perspective projection matrices used in 3D computer graphics, with special emphasis on OpenGL conventions. Starting from camera-space geometry, we construct the projection that maps to clip space, explain clipping via the clip-space inequalities, and show how the subsequent perspective divide produces normalized device coordinates in the canonical view volume. Along the way, we show that affine transformations preserve parallelism. We clarify the roles of the - and -components: Depth in NDC arises from , with carrying the perspective scaling. The result is a compact, implementation-oriented account of Field-of-View, aspect ratio and near-/far-plane parameters.
Introduction
This article describes one of the final steps in the rendering pipeline: Projection and perspective transformations.
While in perspective projection all projection lines converge at a single point - producing the familiar sense of depth - in orthographic projection these lines run parallel to each other. With this projection type, parallel lines therefore remain parallel even after mapping onto the so-called unit cube.
We begin with the orthographic projection and derive a projection matrix that maps an arbitrary view volume onto the so-called unit cube. Along the way, we use figures to show how an orthographic projection affects objects we usually perceive in a perspective manner.
We then derive the perspective projection and show how choosing a near and a far viewing plane, as well as a height and a width encoded in the aspect ratio, creates a so-called view frustum that contains the objects visible to the observer, while any geometry outside of it is clipped and discarded from the rendering process.
Projection Planes
In the final steps of the rendering pipeline, the perspective divide converts clip-space coordinates to normalized device coordinates1; the viewport transform maps them to screen space, and rasterization produces the 2D image.
Before this happens, a view matrix is constructed that transforms world coordinates into camera coordinates2, defined by a vantage point , the observed point and a vector representing the camera’s up direction at camera at looking at .
Transforming world coordinates into camera coordinates isn’t particularly helpful on its own. In a scene graph with hundreds of nodes, we still have to decide which ones should actually appear on screen. Even with a known viewer position and look direction, the camera is still just an idealized point; without a defined viewing volume, no geometry will be captured, and the resulting image will be empty. Conversely, knowing only the viewport's height and width defines the aspect ratio, but it tells us nothing about the field of view, depth range, or clipping planes.
In general, the camera implicit in computing camera coordinates is a pinhole camera - an idealized model with an infinitesimally small aperture that admits rays and produces an image. We push the abstraction a step further: In computer graphics we don’t capture photons, we trace geometric lines from the scene’s vertices through the pinhole onto a projection plane. Take the camera obscura3 as a figurative example, where we place a screen behind the pinhole and the projected points meet on that plane, yielding a mirrored, upside-down image (see Figure 1).
In the following, we will define the viewing volume for both orthographic and perspective projections. We will determine the size of the projection plane and its distance from the camera and examine how these parameters determine the view frustum, which specifies which objects in the scene are ultimately rendered.

Excursus: View Space, Clip Space, the Canonical View Volume and NDCs
In the following, we will give a brief introduction to clipping, which is a process that discards geometry outside of a view volume. In short, primitives removed during clipping will not be part of the Canonical View Volume which contains all necessary geometry passed on to the final rasterization and on-screen display stages of the rendering pipeline.
Clipping is done by OpenGL with the help of the homogeneous coordinate component . The following introduces the process with the help of orthographic projection, but it can be applied analogously with perspective projection, as we will see later.
In camera space, the six planes of a view volume specify the clipping planes (see Figure 2). Roughly speaking, all vertices within this view volume are considered to be preserved for rendering.
After multiplying a vertex with the projection matrix, the vertex shader's homogeneous coordinate
gl_Position
=
is in clip space4.
OpenGL then performs clipping of the primitives against the inequalities [📖KSS17, Figure 5.2, 201]5
Primitives outside clip space are discarded, while the visible portion of intersecting primitives is rasterized [📖SWH15, 2 96ff.].
Once clipping has finished, the clip coordinates are mapped to Normalized Device Coordinates by the perspective divide
This yields the coordinates within the canonical view volume. Since every surviving clip coordinate satisfies the clipping inequalities (e.g., ), every component of the resulting NDC coordinate is guaranteed to lie in the interval [-1, 1]6.
The Canonical Unit Cube, or canonical view volume [📖RTR, 94], defined by the interval 7 on all three axes, is the space for Normalized Device Coordinates in OpenGL.
In the rendering pipeline, NDCs are produced by applying the perspective divide to the clip space coordinates that result from the projection transform. With orthographic projection, it follows that . The coordinates resulting from this normalization step are then mapped to the user's specific device resolution (see [📖LGK23, Fig. 5.36, 181]).
When using orthographic projection, the clip space falls together with the NDCs, as the perspective division by the fourth homogeneous -coordinate is applied, but does not change the -coordinates (see [📖SWH15, 72]).
Let be a point in view space. This is in homogeneous coordinates
After transforming into clip space, the point is mapped to
Let