Structure from Motion with OpenCV

Using OpenCV and other software frameworks to implement structure-from-motion. The idea of SFM is, to use multiple frames of a moving camera to reconstruct the camera trajectory as well as the environment.

The idea of structure-from-motion is, to use multiple frames of a moving camera to reconstruct the camera trajectory as well as the environment. It is difficult to get an accurate model of the environment with a single picture. However, if you have two or more pictures, observing the same thing from different locations, you can triangulate corresponding points in the image.

Arne Nordmann (norro), CC BY-SA 3.0, via Wikimedia Commons

You can, for instance, figure out the position of a point \(X\) if you have the following information:

  • The position and orientation of the camera while taking each of the two or more images
  • The positions of the point \(X\) in both images (\(x_L\) and \(x_R\))

If you have that information, you can draw rays that start at the camera origins \(O_L\) and \(O_R\) and go through \(x_L\) and \(x_R\), respectively. The actual position of \(X\) can then be determined by figuring out, where the rays meet each other.

In the structure-from-motion approach of this post, the only information you can work with, however, is a series of frames that show the same thing from different points of view. The idea here is to find enough corresponding points in each of the pictures to reconstruct the camera positions and orientations. Once those are obtained, the 3D position of any pair of corresponding points can be reconstructed.

Point correspondences

Let's solve the point correspondence first. In general, there are two approaches to solving this problem. With dense matching, you look at every pixel in one image and try to find the corresponding pixel in the other image. This approach produces a result that can later be turned into a dense cloud of reconstructed points. It makes the assumption that every pixel in one image can be seen from the other image as well. Since the two pictures are taken from different positions, there probably are occluded areas that can only be observed from one point of view, however. This means, that you do not only have to try to find a correspondence for each pixel but you also have to deal with pixels that are impossible to match.

Another approach to this is sparse matching. Instead of trying to model a correspondence for each pixel, you try to find a set of points that are easily identifiable in both pictures and match them accordingly. OpenCV offers three algorithms for identifying such points in the image (SIFT, SURF, and ORB). An interesting comparison between the three can be found here. With one of these methods, you can extract key points from every frame independently.

// load image
cv::Mat image = cv::imread("path/to/image.jpg", cv::IMREAD_COLOR);
cv::Mat grayscaleImg;
cv::cvtColor(image, grayscaleImg, cv::COLOR_BGR2GRAY, CV_8U);

// initialize the key point algorithm
cv::Ptr detector = cv::xfeatures2d::SURF::create(...);

// determine key-points and a feature vector for each key point
cv::Mat descriptors;
std::vector keyPoints;
detector->detectAndCompute(grayscaleImg, cv::noArray(), keyPoints, descriptors);

After determining easily identifiable points in the image, it's time to find point correspondences in image pairs. For this, I will use a k-nearest-neighbors approach that tries to find the two best matches for each key point.

// initialize matching algorithm
cv::Ptr matcher = cv::DescriptorMatcher::create(cv::DescriptorMatcher::FLANNBASED);

// find the two best matches in descriptors2 for each entry in descriptors1
std::vector> knnFeatureMatches;
matcher->knnMatch(descriptors1, descriptors2, knnFeatureMatches, 2);

Instead of just trying to find the most probable match, this also allows for a quality assessment of the match. David Lowe proposed a simple method that looks at these two matches and only considers them to be valid if the second-best match is sufficiently less probable. If both matches are close to each other, the risk of choosing the wrong one is higher, and therefore the quality of those matches is not high enough to be sure about them.

std::vector inliers1, inliers2;
for(std::size_t i = 0; i < knnFeatureMatches.size(); ++i) {
    if(knnFeatureMatches[i][0].distance < 0.8 * knnFeatureMatches[i][1].distance) {
        const cv::DMatch& m = knnFeatureMatches[i][0];
        inliers1.push_back(keyPoints1[m.queryIdx].pt);
        inliers2.push_back(keyPoints2[m.trainIdx].pt);
    }
}

This pairwise matching procedure can be done with every subsequent frame and one base image (the first frame in the series). In the example video below, you can see a side-by-side comparison between such a base frame and all subsequent frames. The selected key points (inliers) are drawn as circles and matches are connected via white lines.

0:00
/0:02

Epipolar Geometry

Cameras can be thought of as direction measurement devices. If you observe a point in a picture you can't really tell its exact position without any additional knowledge. You can, however, draw a ray from the camera's origin to that point and know its direction. If you have two pictures with enough corresponding points each, you can use those correspondences and the directions of the rays to get the relative orientation from the first camera perspective to the second one.

As can be seen in the picture at the beginning of this post, every point \(X\) which is visible from two perspectives, forms a triangle with the two camera origins \(\overline{O_R X O_L}\). This relationship can be used even if the position of \(X\) and the relative orientation between the two perspectives are unknown. The fact that the vector \(\overrightarrow{O_LX}\) (and, therefore, also the vector \(\overrightarrow{O_Lx_L}\)) and the vector \(\overrightarrow{O_LO_R}\) span a plane that is parallel to the vector \(\overrightarrow{O_RX}\) (and, therefore, also \(\overrightarrow{O_Rx_R}\)) can be expressed as a system of linear equations:

  • \(x_L E x_R = 0\)

Where \(x_L\) and \(x_R\) are homogenous coordinates (the \(x\)- and \(y\)-coordinates describe the position in the view plane and the \(z\)-coordinate is \(1\)). The coplanar relationship between these points and the camera origins is expressed here by using the \(3 \times 3\) matrix \(E\) (which is called the essential matrix). As long as there are enough point correspondences between two pictures, the matrix \(E\) can always be determined by solving this system of linear equations.

cv::Mat inlierMask;
cv::Mat essentialMat = cv::findEssentialMat(inliers1, inliers2, cameraMatrix, cv::RANSAC, 0.99, 1, 500, inlierMask);

The OpenCV function cv::findEssentialMat() does even more than just solve a system of equations. Using the point correspondences from earlier, this function tries to find the largest set of point pairs that are related via an appropriate essential matrix and marks all other points as outliers. As a result, any invalid matches from earlier are filtered out as well. The inlierMask matrix has the same size as the inlier lists and marks every valid point with a true boolean value.

Pose Recovery

Once, the essential matrix of an image pair is obtained, it can be used to recover the relative pose between the two views:

cv::Mat rotation, translation;
int numValidPoints = cv::recoverPose(essentialMat, inliers1, inliers2, cameraMat, rotation, translation, inlierMask);

The result of this operation is a rotation matrix and a translation vector that represent the transformation from the picture with inliers1 and the picture with inliers2. It is important to note, however, that the scale of that transformation is unknown. Without any additional knowledge, the cameras could be far away, observing a huge landscape, or close together and observing a miniature one. The rotation matrix stays the same in both scenarios but the translation vector length would change. For this reason, OpenCV just returns a unit-length translation vector here, and lets the developer scale it accordingly.

In the video above, you can see pictures from the publicly available Tsukuba Stereo Dataset. This dataset is ideal for demonstrating this structure-from-motion approach because it is completely synthetic. In a real-world scenario, the scaling factor has to be determined by using fixed markers in the scenery with known positions. In this dataset, however, the camera parameters (the camera matrix as well as the actual camera pose) are known and can be used to scale the translation vectors.

double scale = cv::norm(groundtruth.translation());
translation *= scale;

Doing this with all the frames that were shown in the video above, the resulting path looks like this:

The white line shows the actual trajectory while the red line represents the recovered poses (scaled with the ground truth translations). This result was obtained using the method above on the first 90 frames of the dataset. The 3D coordinate system shows the position of the first frame and has axes that are 10cm long.

Bundle Adjustment

The results are not really accurate and get worse, the further the matched camera views are apart from each other. This is the result of numerical inaccuracies and inaccurate key point positions. Instead of just estimating the transformations from the first frame to each of the subsequent frames, the whole data set could be viewed as a global optimization problem. Key points that were used in the first image pair are probably also visible in subsequent frames. By estimating the 3D location of those points and optimizing the camera poses of all views, the results can be improved significantly:

  • \(\frac{1}{2} \sum_i \sum_j | z_{ij} - h(T_i, p_j) |^2\)

This formula describes this problem as least squares optimization. The idea is to iterate over all images (index \(i\)) and all 3D points \(p_j\) and optimize the camera poses \(T_i\) such that the reprojected image points \(h(T_i, p_j)\) and the actually observed image points \(z_{ij}\) are as close as possible. OpenCV offers a structure-from-motion module that allows developers to do such an optimization using the function cv::sfm::reconstruct(). Unfortunately, it is not considered to be a core functionality and is still in early development. For this reason, it usually isn't included by default which means that OpenCV has to be compiled manually to be able to use it.

A more reasonable approach would be, to choose dedicated frameworks for that task. The least squares problem could be formulated manually using an optimization framework like Ceres. Alternatively, a specialized framework like OpenMVG can be used to do all the structure-from-motion steps internally. There are also freely available software solutions with User-interfaces that can do bundle adjustments for a specified set of pictures. The most popular free ones seem to be COLMAP and Meshroom. In the screenshot below, you can see the results that can be achieved on the same data set with COLMAP without any parameter tuning.

In this screenshot, you can not only see an improved camera trajectory (the red markers) but also all the key points in 3D space that were used for the adjustment process. With parameter tuning and additional pictures, the result can be improved even further which highlights the advantage of the global optimization approach.