computer-vision

Simple Optical Tracking

DIY camera tracking of objects that are moving on the ground, with just a webcam and a few lines of code

Robin Wismeth

20 Oct 2022 • 6 min read

The setup that will be covered here, allows you to track an object that moves on a specified plane (e.g. on even ground) using just one webcam and a few lines of Python code. The camera has to be in an elevated position, observing the plane and the object you want to track.

The foundation for the tracking algorithm is the pinhole camera model. This model allows you to get the real-world position $[X_{W}, Y_{W}, Z_{W}]^{T}$ of the $Object$ you want to track using pixel data from the image. For this, you need to identify 3 things:

The image coordinates $[X_{I}, Y_{I}]^{T}$ of the object
The intrinsic parameters of the camera
The Rotation $R_{W}^{C}$ and translation $t_{W}^{C}$ from World Coordinates to Camera Coordinates:
$[X_{C}, Y_{C}, Z_{C}]^{T} = R_{W}^{C} \cdot [X_{W}, Y_{W}, Z_{W}]^{T} + t_{W}^{C}$

Intrinsic Parameters

The first step of determining the image coordinates (the actual pixel positions of the object) can be done in various ways. You can use segmentation or object detection algorithms, for example. Choosing a suitable method here depends on the situation and is out of the scope of this post. For simplicity's sake, let's just assume that the image coordinates are known already.

This brings us to the next step, the intrinsic parameters of the camera. In a perfect world, The camera we use behaves like a pinhole camera. In such a scenario, you could just draw a light ray from each pixel in the image, through the pinhole, pointing exactly to the corresponding object in the real world

Olaf Peters (OlafTheScientist), CC BY-SA 4.0, via Wikimedia Commons

This means, that the image coordinates are connected directly to the real-world object by the projection rays. The origin of the camera coordinate system is the center of the projection $O$ (aka. the pinhole). Computing the 3D reconstruction from image coordinates is like following those rays from the known pixel position, through the pinhole until you reach the object. Understanding the geometry behind this is easier if we first follow that ray in the opposite direction (from known world coordinates to the pixel positions) and then turn that process upside down. Let's say, the world coordinates are known and transformed into the camera coordinate system. To make everything easier, those transformed coordinates are also scaled down so that they are exactly one unit away from the center of the projection.

${\hat{X}}_{C} = \frac{X_{C}}{Z_{C}}$
${\hat{Y}}_{C} = \frac{Y_{C}}{Z_{C}}$
${\hat{Z}}_{C} = \frac{Z_{C}}{Z_{C}} = 1$

These scaled-down coordinates (emphasized by the $\hat{⊔}$ symbol) can be projected onto the pixel coordinate system using the principal point $[c_{x}, c_{y}]$ and the focal length $[f_{x}, f_{y}]$ . For this, the parameters are placed inside the camera matrix $K$ that projects the normalized coordinates $[{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T}$ onto the image plane to get the actual pixel positions.

$K = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}]$
$[X_{I}, Y_{I}, 1]^{T} = K \cdot [{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T}$

All the equations above assume that the camera in use follows the pinhole camera model. However, this model is idealized and doesn't quite represent reality. In reality, camera lenses are used instead of a pinhole. Those lenses are essentially doing the same thing but add some additional distortions to the image so that the pixel/image coordinates aren't quite at the same place as they would be with a pinhole.

Determining the lens distortion parameters as well as the camera matrix can be done with openly accessible libraries like opencv, or with tools like the Matlab Camera Calibrator. The details of that process can be found on those sites and aren't discussed here for that reason.

Rotation and Translation from World to Camera-Coordinates

The third thing you need to determine for the 3D reconstruction is the relation between the camera- and the world coordinates. It is expressed using the rotation $R_{W}^{C}$ and the translation $t_{W}^{C}$ . These parameters are also called extrinsic parameters and can be determined using a perspective-n-point transformation. The transformation requires at least 4 markers that are placed on the plane, you want to observe with the camera. The markers cannot be colinear and you have to know their exact position in world coordinates. Once you have placed the markers and know their exact locations, you can take a picture with your camera and determine the corresponding points in image coordinates

Let's say we have the following corresponding point pairs:

Real World Point	Image Point
$P_{W}^{(0)} = [X_{W}^{(0)}, Y_{W}^{(0)}, Z_{W}^{(0)}]$	$P_{I}^{(0)} = [X_{I}^{(0)}, Y_{I}^{(0)}]$
$P_{W}^{(1)} = [X_{W}^{(1)}, Y_{W}^{(1)}, Z_{W}^{(1)}]$	$P_{I}^{(1)} = [X_{I}^{(1)}, Y_{I}^{(1)}]$
$P_{W}^{(2)} = [X_{W}^{(2)}, Y_{W}^{(2)}, Z_{W}^{(2)}]$	$P_{I}^{(2)} = [X_{I}^{(2)}, Y_{I}^{(2)}]$
$P_{W}^{(3)} = [X_{W}^{(3)}, Y_{W}^{(3)}, Z_{W}^{(3)}]$	$P_{I}^{(3)} = [X_{I}^{(3)}, Y_{I}^{(3)}]$

With those corresponding points, the camera matrix $K$ , and the distoriton params $d$ you can now use OpenCVs solvePnP(...) function to get the rotation matrix $R_{W}^{C}$ and the translation vector $t_{W}^{C}$ .

$solvePnP ([P_{W}^{(0)}, \dots, P_{W}^{(3)}], [P_{I}^{(0)}, \dots, P_{I}^{(3)}], K, d) \to [R_{W}^{C}, t_{W}^{C}]$

Putting everything together

We determined the camera matrix $K$ and the distortion parameters $d$ by calibrating the camera. Using at least 4 known real-world markers, we also determined the extrinsic parameters $R_{W}^{C}$ and $t_{W}^{C}$ using solvePnP(...). With all this information, it's now possible to transform any image point $[X_{I}, Y_{I}]$ into the world coordinate system $[X_{W}, Y_{W}, Z_{W}]$ if $Z_{W}$ is known. Because this whole tracking setup is constrained to objects moving on a plane, the $Z_{W}$ value can simply be measured beforehand. It is the height of the marker that should be measured with respect to the plane with the four fixed markers on it. The first step in this process is turning the image coordinates into normalized camera coordinates using the inverse camera matrix $K^{- 1}$ .

$[{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T} = K^{- 1} \cdot [X_{I}, Y_{I}, 1]^{T}$

The next step is scaling the normalized coordinates by $Z_{C}$ to get the actual point in the camera coordinate system.

$[X_{C}, Y_{C}, Z_{C}]^{T} = [{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T} \cdot Z_{C}$

Unfortunately, we don't know the value of $Z_{C}$ . All we know is $Z_{W}$ so we need to calculate it first. To do that, the relationship between world and camera coordinates can be used. It is described using extrinsic parameters $R_{W}^{C}$ and $t_{W}^{C}$ .

$R_{W}^{C} = [\begin{matrix} r_{00} & r_{01} & r_{02} \\ r_{10} & r_{11} & r_{12} \\ r_{20} & r_{21} & r_{22} \end{matrix}]$
$t_{W}^{C} = [\begin{matrix} x_{t} \\ y_{t} \\ z_{t} \end{matrix}]$
$[X_{C}, Y_{C}, Z_{C}]^{T} = R_{W}^{C} \cdot [X_{W}, Y_{W}, Z_{W}]^{T} + t_{W}^{C}$

Let's first turn this equation around so that the world coordinates are alone on one side. The inverse rotation matrix can be obtained by transposing it and the inverse translation is just the negative value of the vector:

$[X_{W}, Y_{W}, Z_{W}]^{T} = (R_{W}^{C})^{T} \cdot ([X_{C}, Y_{C}, Z_{C}]^{T} - t_{W}^{C})$

Now replace the camera coordinates with the scaled normalized camera coordinates:

$[X_{W}, Y_{W}, Z_{W}]^{T} = (R_{W}^{C})^{T} \cdot ([{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T} \cdot Z_{C} - t_{W}^{C})$

The desired result can now be calculated by simplifying the equation and just looking at the Z-value of the 3D coordinates here:

${\vec{r}}_{2} = [r_{02}, r_{12}, r_{22}]$ (the last row of the inverse rotation matrix)
$Z_{W} = {\vec{r}}_{2} \cdot ([{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T} \cdot Z_{C} - t_{W}^{C})$
$\Rightarrow Z_{W} = {\vec{r}}_{2} \cdot [{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T} \cdot Z_{C} - {\vec{r}}_{2} \cdot t_{W}^{C}$
$\Rightarrow Z_{C} = \frac{Z_{W} + {\vec{r}}_{2} \cdot t_{W}^{C}}{{\vec{r}}_{2} \cdot [{\hat{X}}_{C}, {\hat{Y}}_{C}, 1]^{T}}$

Now, the value of $Z_{C}$ is known, we can go back to the previous equation to calculate the position in the world coordinate system.

Example Implementation

I've also created a Jupyter Notebook example that shows how to actually code everything that was described above:

Reconstruction example

Python example using Jupyter Notebook

reconstruction_example.zip

1 MB