In the coming days I will talk about model reconstruction from images. I have developed a very simple framework for doing reconstruction. The framework consists of feature detection, correspondences, removing of incorrect data, distortion removal, computing the projection matrices, triangulation, and mesh creation. The discussion will involve a simplified problem space. Proofs for methods are left out of the explanation as the project’s scope is to perform the operations rather than the theory that goes behind it.The rig consists of a free moving calibrated camera at a single focus.
Image Capture
![]() |
| Images taken with web cam |
First step is to capture images. The camera must have the same focus for all images, including the images used for camera calibration. For simplicity, the Microsoft LifeCam NX-6000 web cam was used as it has a single focal length. A regular compact camera would give the best results. Compact cameras have a better image sensor (CCD) and lenses rather lower quality image sensor (CMOS) and lenses found in web cams.
Feature Detection
A good feature detector is to create a consistent feature point amongst different images. Characteristics of a good detector are invariant to changes to scale, noise, illumination, and affine (scale and translation) transform. Feature point detection is primarily done with the algorithm called Scale Invariant Feature Transform, or SIFT. Rather than re-implementing available solutions, the project used the VLFeat library. SIFT is invariant to scale, orientation, affine distortion, partially invariant to illumination changes [ref].
Correspondences
![]() |
| Correspondences between two images, 220 points |
Although SIFT is a fairly accurate algorithm, it isn’t perfect. Poor quality features can occur in parts of the image that the color data rapidly changes (i.e., a tree’s leaves), parts were color data doesn’t change (i.e., a flat solid area), highly saturated or under-saturated areas (i.e., lights, darkness), or highly specular or reflective surfaces (i.e. chrome, mirrors). Additionally, there can be bad correspondences from mismatching feature points.
Removing Incorrect Correspondences
![]() |
| Image correspondences after RANSAC, 84 points |
Dense point matching
Once an accurate model has been established, a guided matching can occur. Guided matching usually consists of rectifying the images and comparing them to each other. Image rectification and comparison creates a set of dense point correspondences thus a better end result. This process can be repeated to get the most out of the images. This has yet to be implemented however.
Alternate Method
![]() |
| Handpicked results, 37 correspondences |
Camera Calibration
Camera calibration was done with the Camera Calibration Toolbox for MATLAB. The camera calibration process consists of taking images of a known calibration object. For the toolbox the calibration object is a checkerboard pattern. The user holds the calibration object in different positions while taking pictures of the object. These images are fed through the toolbox and it outputs the calibration information or better known as intrinsic parameters.
Intrinsic Parameters
The intrinsic parameters consist of focal length, principal point, and skew coefficient. Additionally, the toolbox can compute the distortions of the camera. With this information, the camera matrix K is built. The camera matrix allows a metric reconstruction of the scene up to a scale. Additionally, knowing the lens distortions of the camera will allow removal of distortion in the image correspondences. These variables are:
| Name | Description | Variable | Dimensions |
| Focal length | Focal length in pixels | fc | 2x1 vector |
| Principal point | Principal point coordinates | cc | 2x1 vector |
| Skew coefficient | Skew coefficient defining the angle between the x and y pixel axes | alpha_c | 1x1 scalar |
| Distortion coefficients | Image distortion coefficients (radial and tangential distortions) | kc | 5x1 vector |
The calibration matrix is computed (p157):
$K = \[\left(\begin{array}{ccc} a_x & s & x_0 \\ 0 & a_y & y_0 \\ 0 & 0 & 1 \end{array}\right)\] = \[\left(\begin{array}{ccc} fc(1) & alpha_c * fc(1) & cc(1) \\ 0 & fc(2) & cc(2) \\ 0 & 0 & 1 \end{array}\right)\]$
Fundamental Matrix
The basic concept of the fundamental matrix is that it takes one space and projects onto another space. In this application the fundamental matrix takes the first image point set and projects it to the second image point set. The fundamental matrix has the essential matrix, which has the rotation and translation matrices that is needed to make the project matrices. It is computed by taking the Kronecker tensor product for every image correspondences and finding the SVD of the resulting matrix (p. 393)
Essential Matrix
In this case, $K_2$ and $K_1$ are equal because the same camera with the same setting were used. Thus, to recover the essential matrix [4, p. 257]:
$E = K_2^TFK_1=K^TFK$
The essential matrix must be projected into the essential space by:
$[U_E,E_E,V_E ] = SVD(E)$
$E = U_E * \[ \left( \begin{array}{ccc} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right)\] * V_E^T$
Extrinsic Parameters
The extrinsic parameters, rotation $R$ and translation $t$, can be recovered from the essential matrix $E$.
$W = \[ \left( \begin{array}{ccc} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right)\]$
$Z = \[\left(\begin{array}{ccc} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right)\]$
$[U_E,E_E,V_E ] = SVD(E)$
$R = UWV^T or R = UW^T V^T$
$t = U_E*[0 0 1]^T or t = -U_E*[0 0 1]^T$
$Z = \[\left(\begin{array}{ccc} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right)\]$
$[U_E,E_E,V_E ] = SVD(E)$
$R = UWV^T or R = UW^T V^T$
$t = U_E*[0 0 1]^T or t = -U_E*[0 0 1]^T$
Thus now there are four sets of possible rotation and translation pairs. (p. 258 - 259)
The correct rotation and translation is the pair that creates (i.e. triangulates) points in front of both cameras. First, the project matrices have to be computed (see Projection Matrices), this will give $P_1$ and $P_2$. Next, one image pair has to be triangulated (see Triangulation), creating a 3D point $X$. The depth with a positive value for both project matrices and the 3D point will indicate the correct rotation and translation pair. This is done by (p162):
The correct rotation and translation is the pair that creates (i.e. triangulates) points in front of both cameras. First, the project matrices have to be computed (see Projection Matrices), this will give $P_1$ and $P_2$. Next, one image pair has to be triangulated (see Triangulation), creating a 3D point $X$. The depth with a positive value for both project matrices and the 3D point will indicate the correct rotation and translation pair. This is done by (p162):
$X=(X,Y,Z,T)^T$
$P=[ M | p_4 ]$
$depth(X,P) = \frac{ sign(det(M)) * w }{T * ||m^3||}$
Projection Matrices
Computing the projection matrices is a simple task of putting all the information together. Note that the first projection matrix set to origin (0,0,0) with no rotation and translation.
$P_1 = \[\left(\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right)\]$
$P_2 = K * [R,t]$
$P_2 = K * [R,t]$
Triangulation
In a perfect scenario, two rays extending from the points would intersect at a point in space. However, because of slight errors in the point correspondences the two points would never intersect or would intersect at the wrong point in space. The optimal triangulation algorithm brings the points closer to their respective epipolar lines (the plane that intersects the two cameras and the 3d point) so they intersect at a more accurate point in space [4, p318]. The final step of the algorithm is the actual triangulation. The algorithm is performed for every correspondence between the two images creating a point cloud.
The final step is the creation of the 3D model. This can be approached in various ways. For a simple mesh creation, MATLAB’s Delaunay function proved to yield good results. Texture for each triangle face can be taken from corresponding three points on either image. The issue with method is that it won’t work if any of the three points lay on a different plane. The better method is to create a dense point cloud that has the color information for every vertex. For output, a good model format is OBJ as it has an easy to follow textual representation for the model and is supported in many different model editors.
Model Creation
![]() |
| Textured OBJ model viewed in MeshLab |






0 comments:
Post a Comment