Camera Motion Estimation Method using Depth-Normalized Criterion

Seok Lee

doi:10.2352/J.ImagingSci.Technol.2023.67.6.060403

Abstract

For translationally moving objects with fixed cameras, such as robots and cars, blurring can often be more pronounced in objects that are closer to the camera. A depth-normalized, least-squares objective function is proposed for the simultaneous recovery of shape and motion parameters from optical flow, together with an efficient iterative optimization algorithm. Simulation and experiments demonstrate that for scenes with sufficient depth variation, our algorithm provides robust, statistically consistent estimates of shape and motion.

jist

JIMTE6

Journal of Imaging Science and Technology

J. Imaging Sci. Technol.

1062-3701

1943-3522

Society for Imaging Science and Technology

060403

10.2352/J.ImagingSci.Technol.2023.67.6.060403

1574

Work Presented at Electronic Imaging 2024

Camera Motion Estimation Method using Depth-Normalized Criterion

Camera motion estimation method using depth-normalized criterion

LeeSeok

Mechatronics Department, KOREATECH, Cheonan-si, Chungcheongnam-do, South Korea

leeseok@koreatech.ac.kr

Lee

112023

2862023

11112023

2023

Abstract

ccc

1062-3701/2023/67(6)/060403/6/$25.00

printed

Printed in the USA

Introduction

Motion and depth estimation using vision sensors, otherwise known classically as the structure from motion problem (SFM), is an essential component of mobile robot localization, mapmaking, and navigation. While the SFM problem is well-understood and generally considered to be solved, mobile robots are often equipped with off-the-shelf low performance sensors – particularly with the proliferation of low-cost mobile robots for the mass market – and operate in unstructured environments under uneven lighting conditions which makes the problem to be solved based on statistical principles such as Kalman filter or Markov method [1, 2], and this makes the estimation problem still relevant and challenging [3–7].

In the classical structure from motion (SFM) literature, it is now well recognized that noise in the image velocities, together with the presence of just a few outliers, can significantly degrade the estimates of depth and motion. Fermuller et al. [8] and Simoncelli et al. [9] have investigated the probabilistic and statistical characteristics of optical flow measurements, while Daniilidis and Spetsakis [10] offer a comprehensive framework addressing various sources of error in motion estimation (e.g., statistical bias, correlated noise, geometric instabilities).

One factor contributing to this noise sensitivity is that most existing SFM algorithms treat the entire set of optical flow measurements uniformly, regardless of the distance from the camera, or whether the translation or rotation component of the motion is more dominant. In typical video scenes taken in urban settings, for example, it is quite common for objects to be moving at a wide range of camera depths. In the case of translational motions, the magnitude of optical flow is inversely proportional to depth, and blurring which is caused by camera exposure is often more pronounced for objects that are closer to the camera, while the optical flow measurements of extremely distant points tend to be dominated by noise. It would thus seem reasonable to rely more on optical flow measurements that are sufficiently distant from the camera to minimize the effects of blurring, while ensuring an appropriate signal-to-noise ratio. A closely related idea is that of Heeger [11], who formulated image flow uncertainty in such a way that it increases with flow magnitude.

This paper presents a depth-normalized criterion for simultaneously recovering velocity and depth information from optical flow data, together with an efficient iterative algorithm for its optimization. Intended for scenarios where near points are subject to greater blurring, our objective function normalizes the data such that the optical flow measurements from distant points are given proportionally greater weight. We present an efficient cyclic coordinate descent algorithm for obtaining the shape and motion estimates. Finally, extensive simulation and experimental studies are conducted to assess the performance of our algorithm, and results show that, for scenes with sufficient depth variation, our algorithm leads to more robust and accurate shape and motion estimators.

Problem Formulation

2.1

Camera Model & Measurements

We assume a standard perspective projection model for a camera with unit focal length. The image velocity of the camera motion in this case becomes

(1)

u (p) = λ (p) A (p) v + B (p) ω + n (p),

where u(p) = (ux(p), uy(p))T is the two-dimensional image velocity vector at image position p = (px, py, 1)T, v = (vx, vy, vz)T is the camera’s translational velocity, ω = (ωx, ωy, ωz)T is its angular velocity, and the scalar λ(p) is the inverse scene depth at image point p. The term n(p) = (nx(p), ny(p))T denotes noise, and

(2)

\begin{matrix} A (p) & = & [\begin{matrix} 1 & 0 & - p_{x} \\ 0 & 1 & - p_{y} \end{matrix}], \end{matrix}

(3)

\begin{matrix} B (p) & = & [\begin{matrix} - p_{x} p_{y} & 1 + p_{x}^{2} & - p_{y} \\ - (1 + p_{y}^{2}) & p_{x} p_{y} & p_{x} \end{matrix}] . \end{matrix}

Given a collection of n optical flow measurements {(p1, u1), …, (pn, un)}, the objective is to estimate the translational and angular velocities v and ω, and the inverse depths λ1, …, λn associated with each of the image points p1, …, pn in some optimal fashion. It is well known that since λ(p) and v appear as a product in Eq. (1), it is not possible to determine their magnitudes; we therefore adopt the standard practice of assuming ∥v∥ = 1.

Zhang and Tomasi [11] have shown that nonisotropic noise models for optical flow can lead to statistically inconsistent motion parameter estimates, in the sense of infinite-sample unbiasedness and finite-sample convergence - intuitively, the estimates fail to improve in accuracy with more optical flow measurements. Their study also highlights the sometimes fatal consequences caused by inappropriate transformations of the original SFM problem formulation, particularly those based on epipolar geometry. Epipolar methods have the advantage of decoupling the depth and motion estimation problems; by algebraically eliminating depth from the objective function via the epipolar constraint, the dimension of the ensuing optimization problem is significantly reduced. The depth parameters can moreover be recovered by a simple postprocessing procedure involving a singular value decomposition. One study [12] emphasized that the motion-depth decoupling achieved in the various epipolar methods are due to transformations of the fundamental SFM problem and also cited several examples of popular epipolar geometry-based SFM estimators that fail to be statistically consistent e.g.,[13–15].

Zhang and Tomasi [12] further show that under the assumption that the errors of the optical flow measurements are independent, identically distributed, and isotropic (in the sense of being rotationally symmetric), the estimator given by

(4)

\begin{matrix} arg {min}_{ω, v} \sum_{i = 1} inf_{λ_{i}} ∥ A_{i} ([ω] p_{i} + λ_{i} v) - u_{i} ∥^{q}, \end{matrix}

where

ω \in R^{3}

, v ∈ S3 (∥v∥ = 1), q ≥ 1 and ∥⋅∥ denotes the Euclidean two-norm, is statistically consistent (Their objective function is presented in slightly more general form than the one given here). An efficient iterative Gauss-Newton algorithm is also derived.

2.2

Depth Normalized Objective Function

Figure 1 illustrates the image blurring that can occur in typical dynamic urban scenes; this image was taken from a moving car at 1/50 shutter speed. The direction of camera movement is perpendicular to optical axis, and magnitude of motion field is inversely proportional to depth of object scene. During camera exposure, the image is blurred by the motion field which is induced by camera translation. Images captured from sensor always contain this motion blur because camera exposure time is finite and larger than zero. In the case of mobile robot which moves in linear translation, the accuracy of visual navigation is affected by this motion blur in each frame because it uses consecutive camera frames to estimate motion and depth information.

Figure 1.

Blurring and optical flow noise with respect to camera depth value.

To compensate for this particular type of blurring phenomena, we propose a modified version of the objective function (4) that weights the optical flow measurements according to depth:

(5)

\begin{matrix} J (ω, v, λ) = \underset{i = 1}{\sum^{n}} \frac{1}{λ_{i}^{2}} ∥ A_{i} ([ω] p_{i} + λ_{i} v) - u_{i} ∥^{2} . \end{matrix}

Informally, the inverse depth scaling has the effect of “undoing” the perspective projection before considering the noise. To ensure that the flow measurements are of sufficient signal-to-noise ratio, in practical implementations, one would discard measurements that are beyond a certain threshold depth; these and other practical issues are discussed in detail later.

Solution

As is common in the SFM literature, we focus on the case v≠0 because the v = 0 case can be easily detected and treated separately. The optimal λ can be determined parametrically as a function of ω and v from the first-order necessary conditions for optimality, i.e., by setting gradient equals to zero. This leads to

(6)

λ_{k} = \frac{∥ u_{k} - B (p_{k}) ω ∥^{2}}{{(u_{k} - B (p_{k}) ω)}^{T} A (p_{k}) v}

Given values for ω and v, the λ that minimizes the cost function (5) is given by the above. By substituting λ(ω, v) above back into (5), the cost function becomes, after some manipulation,

(7)

\begin{matrix} J (ω, v) = \underset{i = 1}{\sum^{n}} {∥(I - \frac{(u_{i} - B (p_{i}) ω) {(u_{i} - B (p_{i}) ω)}^{T}}{∥ u_{i} - B (p_{i}) ω ∥^{2}}) A (p_{i}) v∥}^{2} . \end{matrix}

We use the following notation:

\begin{matrix} Q_{i} (ω) & = & A (p_{i}) - \frac{(u_{i} - B (p_{i}) ω) {(u_{i} - B (p_{i}) ω)}^{T}}{∥ u_{i} - B (p_{i}) ω ∥^{2}} A (p_{i}), \\ Q (ω) = [\begin{matrix} Q_{1} (ω) \\ ⋮ \\ Q_{n} (ω) \end{matrix}] . \end{matrix}

Note that

Q_{i} (ω) \in R^{3 \times 3}

and

Q (ω) \in R^{2 n \times 3}

. The objective function can now be written as

(8)

\begin{matrix} J (ω, v) = ∥ Q (ω) v ∥^{2} . \end{matrix}

If ω is given, using a Lagrange multiplier argument, one can show that the optimal v is given by the unit-length eigenvector of QTQ corresponding to the smallest eigenvalue where the cost function is symmetric with respect to v, in the sense that both v and −v lead to identical values of the objective function.

Instead of attempting to simultaneously minimize the cost function with respect to ω and v, we minimize sequentially over the two parameters as follows:

∙

Let k = 0 and choose any initial value

ω_{k} \in R^{3}

;

∙

Iterate the following:

vk = unit-length eigenvector of QTQ corresponding to the minimal eigenvalue;

ω_{k + 1} = \underset{ω \in R^{3}}{arg min J (ω, v_{k})}

, where vk is obtained from the previous step;

k = k+1.

Under various compactness and uniqueness assumptions one can show via the global convergence theorem (see [16, 17]) that the above cyclic coordinate descent (CCD) algorithm is assured of converging to meaningful local minima. We do not address the details here but refer the reader to [18, 19] and the previous references for applications of the global convergence theorem in vision settings, and a discussion of the subtleties.

We now examine in more detail the conditional problem of minimizing J(ω, v) given v ∈S2; we denote this conditional objective function by J(ω|v). Defining

(9)

\begin{matrix} b_{i} (ω) = u_{i} - B (p_{i}) ω, \end{matrix}

J(ω|v) can be written after some manipulation as

\begin{matrix} J (ω | v) = \underset{i = 1}{\sum^{n}} (∥ A (p_{i}) v ∥^{2} - \frac{b_{i}^{T} A (p_{i}) v v^{T} A^{T} (p_{i}) b_{i}}{∥ b_{i} ∥^{2}}) . \end{matrix}

Ignoring the ∥A(pi)v∥2 term (since v is assumed given), and defining

(10)

\begin{matrix} R_{i} (v) = A (p_{i}) v v^{T} A^{T} (p_{i}), \end{matrix}

we have the following sum-of-ratios quadratic fractional programming problem:

(11)

\begin{matrix} min_{ω \in R^{3}} J (ω | v) = - \overset{n}{\sum_{i = 1}} \frac{b_{i}^{T} R_{i} b_{i}}{b_{i}^{T} b_{i}} . \end{matrix}

Each Ri is symmetric, positive semidefinite, and of rank one. The analytic gradient of J(ω|v) is useful for numerical optimization purposes:

(12)

\begin{matrix} \frac{\partial J (ω | v)}{\partial ω} & = & \underset{i = 1}{\sum^{n}} (\frac{u_{i}^{T} R_{i} B (p_{i}) - ω^{T} B^{T} (p_{i}) B (p_{i})}{∥ b_{i} ∥^{2}} \\ - \frac{b_{i}^{T} R_{i} b_{i}}{∥ b_{i} ∥^{4}} (u_{i}^{T} B (p_{i}) - ω B^{T} (p_{i}) B (p_{i}))) . \end{matrix}

With this gradient, any number of standard optimization algorithms and specialized algorithms for fractional programming are at our disposal [20].

Experimental Results

4.1

Synthetic Data

Experiments with synthetic data have been performed with our proposed algorithm, and the results are compared with the algorithms of Zhang and Tomasi [12] and Soatto and Brockett [21]; the latter developed a cyclic descent optimization algorithm for a standard epipolar geometry-based motion estimation criterion. 50 feature points are randomly generated from a uniform distribution in a three-dimensional 120 × 120 × 120 region. These points are assumed to belong to a single rigid body moving with translational velocity (1, 3, 2) and angular velocity (− 1, 0.5, 1.5). Corresponding optical flow measurements are obtained via perspective projection. Independent uncorrelated Gaussian noise is then added to the measurements after scaling the noise by depth. In our simulations noise levels are successively increased up to 50% of the average optical flow magnitudes.

Spherical velocity errors are measured according to

(13)

\begin{matrix} d (v_{a c t}, v_{e s t}) = {cos}^{- 1} (v_{a c t}, v_{e s t}), \end{matrix}

where vact ∈ S2 denotes the actual velocity, and vest ∈ S2 denotes the estimated value obtained from the optimization. Physically, this metric corresponds to the angle between vact and vest; that this definition satisfies the distance metric axioms can be straightforwardly verified. Linear velocity errors are measured in the standard way in terms of the Euclidean two-norm. In the optimization procedure we use the stopping criterion

(14)

\begin{matrix} \frac{|J_{k + 1} (ω, v) - J_{k} (ω, v)|}{|J_{k} (ω, v)|} < ϵ, \end{matrix}

where ϵ is on the order of 10−6.

We first examined whether increasing the number of feature points increases the accuracy of the linear and angular velocity estimates produced by our depth-normalized criterion. Adding depth-scaled zero-mean Gaussian noise with 0.1 standard deviation to the optical flow measurements, we examined both the error and standard deviation of the linear and angular velocity estimates as a function of the number of feature points. The feature points are increased from 1,000 to 10,000 in increments of 1000. Figure 2 shows the results of our algorithm for synthetic data, averaged over 50 sample trials; the top graph illustrates the estimation error bias, while the bottom shows the standard deviation, both as a function of the number of feature points. Our results are also compared with those obtained using the Zhang-Tomasi (Z–T) and Soatto-Brockett (S–B) algorithms. All three algorithms display a distinct trend of decreasing bias and variance with increasing number of feature points; however, our depth-normalized criterion shows the most rapid decrease.

Figure 2.

(a) Estimation error bias and (b) standard deviation versus number of feature points.

We then examined the noise sensitivity of our depth-normalized algorithm. 30 feature points are used, and noise levels are successively increased up to 50% of the average optical flow magnitudes (corresponding to absolute values of around 0.2). Figure 3 illustrates the linear and angular velocity estimation errors and standard deviation as a function of increasing noise levels. The errors are again obtained as the average of 50 trials, with the ranges indicating plus-minus one standard deviation. The errors can be seen to increase in approximately linear fashion as noise levels are increased.

Figure 3.

(a) Linear and (b) angular velocity errors as a function of noise level.

Table I.

Computation times for the three algorithms.

	Proposed	Z–T	S–B
Time (s)	0.73	0.70	0.49
Iterations	7.2	5.8	4.8

Table I lists the average computation times for the three algorithms. All the algorithms were implemented in the Microsoft Visual C++. Feature point detection and optical flow calculation were performed using the appropriate OpenCV routines, and the nonlinear optimization routines from IMSL’s PC version were used for numerical optimization, in conjunction with an internally developed matrix computation library (RMatrix).

Not surprisingly, our proposed algorithm was the slowest, followed closely by the Z–T algorithm. Since the above two algorithms explicitly solve for the depth parameters in the optimization, this result is not unexpected. The Z–T algorithm, which eliminates the depth parameters altogether via the epipolar constraint, was the fastest of the three algorithms.

4.2

Real Images

We then evaluated our algorithm on a series of scenes that are captured from a Point Grey Flea camera mounted on a Pioneer PeopleBot; this camera was used to vary the exposure time and iris so as to produce a range of blurring effects. The scene depicted in Figures 4, 5 contains objects at depths of up to 30 m. We deliberately obtained motion sequences at slow shutter speeds to capture the blurring effect and raise noise levels. Optical flow measurements were obtained using the OpenCV pyramidal implementation of the iterative Lucas-Kanade method; the feature points were also extracted using the OpenCV library.

Figure 4.

Experimental results for scene 1.

Figure 5.

Experimental results for scene 2.

The camera underwent a linear translation directly toward the objects along the robot and the optical flow measurements are shown in the upper-right figure. One can observe the relatively large number of incorrect optical flow vectors for the proximal object; the errors for the proximal object are more pronounced than for distal objects. Comparing the optical flow fields estimated using our proposed algorithm with that obtained from the Z–T algorithm, our algorithm shows better performance; the directional errors present in the measured flow field are largely corrected using our depth-normalized criterion.

Conclusion

One factor contributing to the noise sensitivity of existing SFM algorithms is that the optical flow measurements of all points, regardless of their depth, are treated with the same degree of fidelity. This paper has proposed a depth-normalized criterion that places a greater weight on the optical flow measurements at increased depths. The underlying premise is that for mobile robots and cars with fixed cameras that are traveling linearly in typical scenes, particularly in urban environments, blurring (and thus more noise) is often more pronounced in objects that are closer to the camera. We derived an efficient cyclic optimization algorithm for estimating the velocity and depth parameters. Experiments with both synthetic data and real images suggest that for scenes with sufficient depth variation, and in which translational motions are dominant, our depth-normalized criterion leads to improved estimates of the velocity and depth. The proposed motion estimation method is not superior in terms of computational efficiency because the depth value is obtained during the optimization process, while other reference algorithms calculate it explicitly. Future work should explore more computationally efficient motion estimation algorithms using the proposed depth normalization criterion. We are also working to improve the current implementation of the proposed method, including removing dependencies on internally developed libraries.

Acknowledgment

This paper was supported by Education and Research promotion program of KOREATECH in 2022.

References

1BashiriM.VatankhahH.GhidaryS. S.2011Hybrid adaptive differential evolution for mobile robot localizationJ. Intel. Serv. Robotics59910799–10710.1007/s11370-012-0106-2

2ParhiD. R.KunduS.2017Navigation control of underwater robot using dynamic differential evolution approachProc IMechE Part M: J Engineering for the Maritime Environment231284301284–301

3KimJ.ParkC.KweonI. S.2011Vision-based navigation with efficient scene recognitionJ. Intel. Serv. Robotics4191202191–20210.1007/s11370-011-0091-x

4NiuL.SmirnovS.MattilaJ.GotchevA.RuizE.Robust pose estimation with a stereoscopic camera in harsh environmentsProc. IS&T Electronic Imaging: Intelligent Robotics and Industrial Applications using Computer Vision 20182018IS&TSpringfield126-1126-6126-1–610.2352/ISSN.2470-1173.2018.09.IRIACV-126

5AlatiseM. B.HanckeG. P.2020A review on challenges of autonomous mobile robot and sensor fusion methodsIEEE Access8398303984639830–4610.1109/ACCESS.2020.2975643

6ZhangX.WangW.QiX.LiaoZ.WeiR.2019Point-plane SLAM using supposed planes for indoor environmentsSensors19379510.3390/s19173795

7PlacedJ. A.CastellanosJ. A.2020A deep reinforcement learning approach for active SLAMAppl. Sci.10838610.3390/app10238386

8FermullerC.ShulmanD.PlessR.2001The statistics of optical flowComput. Vis. Image Underst.821321–3210.1006/cviu.2000.0900

9SimoncelliE. P.AdelsonE. H.HeegerD. J.Probability distributions of optical flowIEEE Conf. on Computer Vision and Pattern Recognition1991IEEEPiscataway, NJ10.1109/CVPR.1991.139707

10DaniilidisK.SpetsakisM.AloimonosY.Understanding noise sensitivity in structure from motionVisual Navigation1996Psychology PressEast Sussex618861–88

11HeegerD. J.1988Optical flow using spatiotemporal filtersInt. J. Comput. Vis.1279302279–30210.1007/BF00133568

12ZhangT.TomasiC.2002On the consistency of instantaneous rigid motion estimationInt. J. Comput. Vis.46517951–7910.1023/A:1013248231976

13BrussA.HornB.1983Passive navigationComput. Graph. Image Process.213203–2010.1016/S0734-189X(83)80026-7

14ZhuangX.HuangT.AhujaN.HaralickR.1988A simplified linear optic flow-motion algorithmComput. Vis. Graph. Image Process.42334344334–4410.1016/S0734-189X(88)80043-4

15JepsonA. D.HeegerD. J.HarrisL.JenkinsM.Linear subspace methods for recovering translation directionSpatial Vision in Humans and Robots1993Cambridge University PressCambridge396239–62

16ZangwillW.Nonlinear Programming: A Unified Approach1969Prentice-HallEnglewood Cliffs

17LuenbergerD. G.Linear and Nonlinear Programming1989Addison WesleyBoston

18MahamudS.HebertM.OmoriY.McHenryK.PonceJ.Provably-convergent iterative methods for projective structure from motionIEEE Int’l. Conf. Computer Vision & Pattern Recognition2001IEEEPiscataway, NJ101810251018–2510.1109/CVPR.2001.990642

19GwakS.KimJ.ParkF. C.2003Numerical optimization on the Euclidean group with applications to camera calibrationIEEE Trans. Robot. Autom.19657465–7410.1109/TRA.2002.807530

20SchaibleS.ShiJ.2003Fractional programming: the sum-of-ratios case Optimization Methods Softw.18219229219–2910.1080/1055678031000105242

21SoattoS.BrockettR.2000Optimal structure from motion: Local ambiguities and global estimatesInt. J. Comput. Vis.39195228195–22810.1023/A:1026563712076