A General Optimization-based Framework for Local Odometry
Estimation with Multiple Sensors

arXiv:1901.03638v1 [cs.CV] 11 Jan 2019

Tong Qin, Jie Pan, Shaozu Cao, and Shaojie Shen
Abstract— Nowadays, more and more sensors are equipped
on robots to increase robustness and autonomous ability. We
have seen various sensor suites equipped on different platforms,
such as stereo cameras on ground vehicles, a monocular camera
with an IMU (Inertial Measurement Unit) on mobile phones,
and stereo cameras with an IMU on aerial robots. Although
many algorithms for state estimation have been proposed in the
past, they are usually applied to a single sensor or a specific
sensor suite. Few of them can be employed with multiple sensor
choices. In this paper, we proposed a general optimization-based
framework for odometry estimation, which supports multiple
sensor sets. Every sensor is treated as a general factor in our
framework. Factors which share common state variables are
summed together to build the optimization problem. We further
demonstrate the generality with visual and inertial sensors,
which form three sensor suites (stereo cameras, a monocular
camera with an IMU, and stereo cameras with an IMU). We
validate the performance of our system on public datasets and
through real-world experiments with multiple sensors. Results
are compared against other state-of-the-art algorithms. We
highlight that our system is a general framework, which can
easily fuse various sensors in a pose graph optimization. Our
implementations are open source1 .

I. I NTRODUCTION
Real-time 6-DoF (Degrees of Freedom) state estimation
is a fundamental technology for robotics. Accurate state
estimation plays an important role in various intelligent
applications, such as robot exploration, autonomous driving,
VR (Virtual Reality) and AR (Augmented Reality). The most
common sensors we use in these applications are cameras. A
large number of impressive vision-based algorithms for pose
estimation has been proposed over the last decades, such as
[1]–[5]. Besides cameras, the IMU is another popular option
for state estimation. The IMU can measure acceleration and
angular velocity at a high frequency, which is necessary for
low-latency pose feedback in real-time applications. Hence,
there are numerous research works fusing vision and IMU
together, such as [6]–[12]. Another popular sensor used in
state estimation is LiDAR. LiDAR-based approaches [13]
achieve accurate pose estimation in a confined local environment. Although a lot of algorithms have been proposed
in the past, they are usually applied to a single input sensor
or a specific sensor suite.
Recently, we have seen platforms equipped with various
sensor sets, such as stereo cameras on ground vehicles, a
monocular camera with an IMU on mobile phones, stereo
All authors are with the Department of Electronic and
Computer Engineering, Hong Kong University of Science and
Technology, Hong Kong, China. {tong.qin, jie.pan,
shaozu.cao}@connect.ust.hk, eeshaojie@ust.hk.
1 https://github.com/HKUST-Aerial-Robotics/VINS-Fusion

Fig. 1. An illustration of the proposed framework for state estimation,
which supports multiple sensor choices, such as stereo cameras, a monocular
camera with an IMU, and stereo cameras with an IMU. Each sensor is
treated as a general factor. Factors which share common state variables are
summed together to build the optimization problem.

cameras with an IMU on aerial robots. However, as most
traditional algorithms were designed for a single sensor or
a specific sensor set, they cannot be ported to different
platforms. Even for one platform, we need to choose different sensor combinations in different scenarios. Therefore,
a general algorithm which supports different sensor suites
is required. Another practical requirement is that in case
of sensor failure, an inactive sensor should be removed
and an alternative sensor should be added into the system
quickly. Hence, a general algorithm which is compatible with
multiple sensors is in need.
In this paper, we propose a general optimization-based
framework for pose estimation, which supports multiple
sensor combinations. We further demonstrate it with visual
and inertial sensors, which form three sensor suites (stereo
cameras, a monocular camera with an IMU, and stereo cameras with an IMU). We can easily switch between different
sensor combinations. We highlight the contribution of this
paper as follows:
a general optimization-based framework for state estimation, which supports multiple sensors.
• a detailed demonstration of state estimation with visual
and inertial sensors, which form different sensor suites
(stereo cameras, a monocular camera + an IMU, and
stereo cameras + an IMU).
• an evaluation of the proposed system on both public
datasets and real experiments.
•

•

open-source code for the community.
II. R ELATED W ORK

State estimation has been a popular research topic over the
last decades. A large number of algorithms focus on accurate
6-DoF pose estimation. We have seen many impressive
approaches that work with one kind of sensor, such as
visual-based methods [1]–[5], LiDAR-based methods [13],
RGB-D based methods [14]. and event-based methods [15].
Approaches work with a monocular camera is hard to achieve
6-DoF pose estimation, since absolute scale cannot be recovered from a single camera. To increase the observability
and robustness, multiple sensors which have complementary
properties are fused together.
There are two trends of approaches for multi-sensor fusion. One is filter-based methods, the other is optimizationbased methods. Filter-based methods are usually achieved by
EKF (Extended Kalman Filter). Visual and inertial measurements are usually filtered together for 6-DoF state estimation.
A high-rate inertial sensor is used for state propagation and
visual measurements are used for the update in [9, 16].
MSCKF [6, 7] was a popular EKF-based VIO (Visual Inertial
Odometry), which maintained several camera poses and
leveraged multiple camera views to form the multi-constraint
update. Filter-based methods usually linearize states earlier
and suffer from error induced by inaccurate linear points.
To overcome the inconsistency caused by linearized error,
observability constrained EKF [17] was proposed to improve
accuracy and consistency. An UKF (Unscented Kalman
Filter) algorithm was proposed in [18], where visual, LiDAR and GPS measurements were fused together. UKF is
an externsion of EKF without analytic Jacobians. Filterbased methods are sensitive to time synchronization. Any
late-coming measurements will cause trouble since states
cannot be propagated back in filter procedure. Hence, special ordering mechanism is required to make sure that all
measurements from multiple sensors are in order.
Optimization-based methods maintain a lot of measurements and optimize multiple variables at once, which is also
known as Bundle Adjustment (BA). Compared with filterbased method, optimization-based method have advantage
in time synchronization. Because the big bundle serves as a
nature buffer, it can easily handle the case when measurments
from multiple sensors come in disorder. Optimization-based
algorithms also outperform the filter-based algorithms in
term of accuracy at the cost of computational complexity.
Early optimization solvers, such as G2O [19], leveraged
the Gauss-Newton and Levenberg-Marquardt approaches
to solve the problem. Although the sparse structure was
employed in optimization solvers, the complexity grown
quadratically with the number of states and measurements.
In order to achieve real-time performance, some algorithms
have explored incremental solvers, while others bounded
the size of the pose graph. iSAM2 [20] was an efficient
incremental solver, which reused the previous optimization
result to reduce computation when new measurements came.
The optimization iteration only updated a small part of states

Fig. 2.
A graphic illustration of the pose graph. Each node represents
states (position, orientation, velocity and so on) at one moment. Each edge
represents a factor, which is derived by one measurement. Edges constrain
one state, two states or multiple states.

instead of the whole pose graph. Afterward, an accelerated
solver was proposed in [21], which improved efficiency by
reconstructing dense structure into sparse blocks. Methods,
that keep a fixed sized of pose graph, are called slidingwindow approaches. Impressive optimization-based VIO approaches, such as [8, 10, 12], optimized variables over
a bounded-size sliding window. The previous states were
marginalized into a prior factor without loss of information in [8, 12]. In this paper, we adopt a sliding-window
optimization-based framework for state estimation.
III. S YSTEM OVERVIEW
The structure of proposed framework is shown in Fig. 1.
Multiple kinds of sensors can be freely combined. The
mesurement of each sensor is treated as a general factor.
Factors and their related states form the pose graph. An
illustration of pose graph is shown in Fig. 2. Each node
represents states (position, orientation, velocity and so on)
at one moment. Each edge represents a factor, which is
derived by one measurement. Factors constrain one state,
two states or multiple states. For IMU factor, it constrains
two consecutive states by continuous motion restriction. For
a visual landmark, its factor constrains multiple states since
it is observed on multiple frames. Once the graph is built,
optimizing it equals to finding the configuration of nodes that
match all edges as much as possible.
In this paper, we specifically demonstrate the system with
visual and inertial sensors. Visual and inertial sensors can
form three combinations for 6-DoF state estimation, which
are stereo cameras, a monocular camera with an IMU, and
stereo cameras with an IMU. A graphic illustration of the
proposed framework with visual and inertial sensors is shown
in Fig. 3. Several camera poses, IMU measurements and
visual measurements exist in the pose graph. The IMU and
one of cameras are optional.
IV. M ETHODOLOGY
A. Problem Definition
1) States: Main states that we need to estimate includes
3D position and orientation of robot’s center. In addition, we

Fig. 3. A graphic illustration of the proposed framework with visual and inertial sensors. The IMU and one of cameras are optional. Therefore, it forms
three types (stereo cameras, a monocular camera with an IMU, and stereo cameras with an IMU). Several camera poses, IMU measurements and visual
measurements exist in the pose graph.

have other optional states, which are related to sensors. For
cameras, depths or 3D locations of visual landmarks need to
be estimated. For IMU, it produces another motion variable,
velocity, Also, time-variant acceleration bias and gyroscope
bias of the IMU are needed to be estimated. Hence, for visual
and inertial sensors, whole states we need to estimate are
defined as follows:
X = [p0 , R0 , p1 , R1 , ..., pn , Rn , xcam , ximu ]
xcam = [λ0 , λ1 , ..., λl ]

(1)

ximu = [v0 , ba0 , bg0 , v1 , ba1 , bg1 , ..., vn , ban , bgn ],
where p and R are basic system states, which correspond to
position and orientation of body expressed in world frame.
xcam is camera-ralated state, which includes depth λ of each
feature observed in the first frame. ximu is IMU-ralated
variable, which is composed of velocity v, acceleration bias
ba and gyroscope bias bg . ximu can be omitted if we only
use stereo camera without an IMU. The translation from
sensors’ center to body’s center are assumed to be known,
which are calibrated offline. In order to simplify the notation,
we denote the IMU as body’s center (If the IMU is not used,
we denote left camera as body’s center).
2) Cost Function: The nature of state estimation is an
MLE (Maximum Likelihood Estimation) problem. The MLE
consists of the joint probability distribution of robot poses
over a period of time. Under the assumption that all measurements are independent, the problem is typically derived
as,
n Y
Y
∗
X = arg max
p(zkt |X ),
(2)
X

t=0 k∈S

where S is the set of measurements, which come from
cameras, IMU and other sensors. We assume the uncertainty of measurements is Gaussian distributed, p(zkt |X ) ∼
N (z̄kt , Ωkt ). Therefore, the negative log-likelihood of abovementioned equation is written as,
X ∗ = arg max
X

= arg min
X

n Y
Y
t=0 k∈S

n X
X
t=0 k∈S

exp(−

1 k
2
z − hkt (X ) Ωk )
t
2 t
(3)

2
zkt − hkt (X ) Ωk .
t

2

The Mahalanobis norm is defined as krkΩ = rT Ω−1 r.
h(·) is the sensor model, which is detailed in the following
section. Then the state estimation is converted to a nonlinear
least squares problem, which is also known as Bundle
Adjustment (BA).
B. Sensor Factors
1) Camera Factor: The framework supports both monocular and stereo cameras. The intrinsic parameters of every
camera and the extrinsic transformation between cameras are
supposed to be known, which can be easily calibrated offline.
For each camera frame, corner features [22] are detected.
These features are tracked in previous frame by KLT tracker
[23]. For the stereo setting, the tracker also matches features
between the left image and right image. According to the
feature associations, we construct the camera factor with per
feature in each frame. The camera factor is the reprojection
process, which projects a feature from its first observation
into following frames.
Considering the feature l that is first observed in the image
i, the residual for the observation in the following image t
is defined as:
zlt − hlt (X ) = zlt − hlt (Ri , pi , Rt , pt , λl )
 l
 l
u
u
−1
−1
−1
= lt − πc (Tbc Tt Ti Tbc πc (λl , li )),
vt
vi
(4)
where [uli , vil ] is the first observation of the l feature that
appears in the i image. [ult , vtl ] is the observation of the same
feature in the t image. πc and πc−1 are the projection and
back-projection functions which depend on camera model
(pinhole, omnidirectional or other models).
 T is the 4x4
Rp
homogeneous transformation, which is
. We omit
01
some homogeneous terms for concise expression. Tcb is the
extrinsic transformation from body center to camera center,
which is calibrated offline. The covariance matrix Ωlt of
reprojection error is a constant value in pixel coordinate,
which comes from the camera’s intrinsic calibration results.
This factor is universal for both left camera and right
camera. We can project a feature from the left image to
the left image in temporal space, also we can project a
feature from the left image to the right image in spatial space.

For different cameras, a different extrinsic transformation Tcb
should be used.
2) IMU Factor: We use the well-known IMU preintegration algorithm [11, 12] to construct the IMU factor. We
assume that the additive noise in acceleration and gyroscope
measurements are Gaussian white noise. The time-varying
acceleration and gyroscope bias are modeled as a random
walk process, whose derivative is Gaussian white noise.
Since the IMU acquires data at a higher frequency than other
sensors, there are usually multiple IMU measurements existing between two frames. Therefore, we pre-integrate IMU
measurements on the manifold with covariance propagation.
The detailed preintegration can be found at [12]. Within two
time instants, t−1 and t, the preintegration produces relative
position αt−1
, velocity β t−1
and rotation γ t−1
. Also, the
t
t
t
preintegration propagates the covariance of relative position,
velocity, and rotation, as well as the covariance of bias. The
IMU residual can be defined as:
zimu
− himu
(X ) =
t
t

 t−1  
−1
αt
Rt−1 (pt − pt−1 + 12 gdt2 − vt−1 dt)
−1

 β t−1  
Rt−1 (vt − vt−1 + gdt)
t

 
 t−1
−1
,
γ  
R
R
t−1
t

 t  

 0  
bat − bat−1
b
−
b
0
gt
gt−1
(5)
where
is the minus operation on manifold, which is
specially used for non-linear rotation. dt is the time interval
between two time instants. g is the known gravity vector,
whose norm is around 9.81. Every two adjacent frames
construct one IMU factor in the cost function.
3) Other Factors: Though we only specify camera and
IMU factors, our system is not limited to these two sensors.
Other sensors, such as wheel speedometer, LiDAR and
Radar, can be added into our system without much effort.
The key is to model these measurements as general residual
factors and add these residual factors into cost function.
C. Optimization
In traditional, the nonlinear least square problem of eq.3
is solved by Newton-Gaussian or Levenberg-Marquardt approaches. The cost function is linearized with respect to an
initial guess of states, X̂ . Then, the cost function is equals
to:
n X
X
2
arg min
ekt + Jkt δX Ωk ,
(6)
δX

D. Marginalization
Since the number of states increases along with time,
the computational complexity will increase quadratically
accordingly. In order to bound the computational complexity,
marginalization is incorporated without loss of useful information. Marginalization procedure converts previous measurements into a prior term, which reserves past information.
The set of states to be marginalized out is denoted as Xm ,
and the set of remaining states is denoted as Xr . By summing
all marginalized factors (eq.7), we get a new H and b. After
rearrange states’ order, we get the following relationship:


  
Hmm Hmr δXm
b
= m .
(8)
Hrm Hrr
δXr
br
The marginalization is carried out using the Schur complement [25] as follows:
−1

−1

(Hrr − Hrm Hmm Hmr ) δXr = br − Hrm Hmm bm .
{z
}
|
{z
}
|
bp

Hp

(9)
We get a new prior Hp , bp for the remaining states. The
information about marginalized states is converted into prior
term without any loss. To be specific, we keep ten spacial
camera frames in our system. When a new keyframe comes,
we marginalize out the visual and inertial factors, which are
related with states of the first frame.
After we get the prior information about current states,
with Bayes’ rule, we could calculate the posterior as a
product of likelihood and prior: p(X |z) ∝ p(z|X )p(X ).
The state estimation then becomes a MAP (Maximum A
Posteriori) problem. Denote that we keep states from instant
m to instant n in the sliding window. The states before m are
marginalized out and converted to a prior term. Therefore,
the MAP problem is written as:
n Y
Y
∗
Xm:n
= arg max
p(zkt |Xm:n ) p(Xm:n )
Xm:n

t=m k∈S

= arg min
Xm:n

n X
X

2

zkt − hkt (Xm:n ) Ωk

t=m k∈S

(10)

t

+ (Hp δXm:n − bp ).

t

t=0 k∈S

where J is the Jacobian matrix of each factor with respect to
current states X̂ . After linearization approximation, this cost
function has closed-form solution of δX . We take NewtonGaussian as example, the solution is derived as follows,
X X T −1
X X T −1
Jkt Ωkt Jkt δX = −
Jkt Ωkt ekt . (7)
|
{z
}
|
{z
}
H

solver [24] to solve this problem, which utilizes advanced
mathematical tools to get stable and optimal results efficiently.

b

Finally, current state X̂ is updated with X̂ ⊕ δX , where ⊕ is
the plus operation on manifold for rotation. This procedure
iterates several times until convergence. We adopt Ceres

Compared with eq.3, the above-mentioned equation only
adds a prior term. It is solved as same as eq.3 by Ceres
solver [24].
E. Discussion
The proposed system is a general framework. Various
sensors can be easily added into our system, as long as it
can be derived as a general residual factor. Since our system
is not specially designed for a certain sensor, it is capable
to handle sensor failure case. When sensor failure occurs,
we just remove factors of the inactive sensor and add new
factors from other alternative sensors.

TABLE I
RMSE[ M ] IN E U RO C DATASET.

Translation error [m]

Relative Pose Error in MH05
1
0.5
0
2

10

20

30

40

30

40

Rotation error [deg]

Distance [m]

4
3

Proposed (stereo)
Proposed (mono+imu)
Proposed (stereo+imu)
OKVIS (stereo+imu)

2
1

Sequence

Length

MH 01
MH 02
MH 03
MH 04
MH 05
V1 01
V1 02
V1 03
V2 01
V2 02
V2 03

79.84
72.75
130.58
91.55
97.32
58.51
75.72
78.77
36.34
83.01
85.23

stereo
0.54
0.46
0.33
0.78
0.50
0.55
0.23
x
0.23
0.20
x

Proposed RMSE
mono+imu
stereo+imu
0.18
0.24
0.09
0.18
0.17
0.23
0.21
0.39
0.25
0.19
0.06
0.10
0.09
0.10
0.18
0.11
0.06
0.12
0.11
0.10
0.26
0.27

OKVIS
RMSE
0.16
0.22
0.24
0.34
0.47
0.09
0.20
0.24
0.13
0.16
0.29

0
2

10

20

Distance [m]

Fig. 4. Relative pose error [26] in MH 05 difficult. Two plots are relative
errors in translation and rotation respectively.

Translation error [m]

Relative Pose Error in V202
0.8
0.6
0.4
0.2
0
2

10

20

30

40

30

40

Rotation error [deg]

Distance [m]
10

Proposed (stereo)
Proposed (mono+imu)
Proposed (stereo+imu)
OKVIS (stereo+imu)

5

0
2

10

20

Distance [m]

Fig. 5. Relative pose error [26] in V2 02 medium. Two plots are relative
errors in translation and rotation respectively.

V. E XPERIMENTAL R ESULTS
We evaluate the proposed system with visual and inertial
sensors both on datasets and with real-world experiments.
In the first experiment, we compare the proposed algorithm
with another state-of-the-art algorithm on public datasets. We
then test our system in the large-scale outdoor environment.
The numerical analysis is generated to show the accuracy of
our system in detail.
A. Datasets
We evaluate our proposed system using the EuRoC MAV
Visual-Inertial Datasets [27]. This datasets are collected
onboard a micro aerial vehicle, which contain stereo images
(Aptina MT9V034 global shutter, 752x480 monochrome, 20

FPS), synchronized IMU measurements (ADIS16448, 200
Hz), Also, the ground truth states are provided by VICON
and Leica MS50. We run datasets with three different combinations of sensors, which are stereo cameras, a monocular
camera with an IMU, stereo cameras with an IMU separately.
In this experiment, we compare our results with
OKVIS [8], a state-of-the-art VIO that works with stereo
cameras and an IMU. OKVIS is another optimization-based
sliding-window algorithm. OKVIS is specially designed for
visual-inertial sensors, while our system is a more general
framework, which supports multiple sensors combinations.
We tested the proposed framework and OKVIS with all
sequences in EuRoC datasets. We evaluated accuracy by
RPE (Relative Pose Errors) and ATE (Absolute Trajectory
Errors). The RPE is calculated by tools proposed in [26].
The RPE (Relative Pose Errors) plot of two sequences,
MH 05 difficult and V2 02 medium, are shown in Fig. 4 and
Fig. 5 respectively.
The RMSE (Root Mean Square Errors) of ATE for all
sequences in EuRoC datasets is shown in Table. I. Estimated
trajectories are aligned with the ground truth by Horn’s
method [28]. The stereo-only case fails in V1 03 difficult
and V2 03 difficult sequences, where the movement is too
aggressive for visual tracking to survive. Methods which
involves the IMU work successfully in all sequences. It
is a good case to show that the IMU can dramatically
improve motion tracking performance by bridging the gap
when visual tracks fail due to illumination change, textureless area, or motion blur.
From the relative pose error and absolute trajectory error,
we can see that the stereo-only method performed worst in
most sequences. Position and rotation drift obviously grown
along with distance in stereo-only case. In other words,
the IMU significantly benefited vision in states estimation.
Since the IMU measures gravity vector, it can effectively
suppress drifts in roll and pitch angles. Stereo cameras with
an IMU didn’t always perform best, because it requires more
accurate calibration than the case of a monocular camera with
an IMU. Inaccurate intrinsic and extrinsic calibration will
introduce more noise into the system. In general, multiple
sensor fusion increase the robustness of the system. Our
results outperforms OKVIS in most sequences.

Relative Pose Error in Outdoor Dataset
3

stereo
mono+imu
stereo+imu

Translation error [m]

2.5
2
1.5
1
0.5
0
2

20

30

40

50

Distance [m]

Fig. 8.

Fig. 6. The self-developed sensor suite used in the outdoor environment.
It contains stereo cameras (mvBlueFOX-MLC200w, 20Hz) and DJI A3
controller, which include inbuilt IMU (200Hz) and GPS receiver.

Outdoor Trajecotry
20

Proposed(stereo)
Proposed(mono+imu)
Proposed(stereo+imu)
GPS

15

10

y [m]

5

0

-5

-10

Relative pose error [26] in outdoor experiment.

around on the outdoor ground. We run states estimation with
three different combinations, which are stereo cameras, a
monocular camera with an IMU, and stereo cameras with an
IMU.
For accuracy comparison, we walked two circles on the
ground and compared our estimation with GPS. The trajectory is shown in Fig. 7, and the RPE (Relative Pose Error) is
shown in Fig. 8. As same as dataset experiment, noticeable
position drifts occurred in the stereo-only scenario. With the
assistance of the IMU, the accuracy improves a lot. The
RMSE of more outdoor experiments is shown in Table. II.
The method which involves the IMU always performs better
than the stereo-only case.
VI. C ONCLUSION

-15

-20
-5

0

5

10

15

20

25

30

x [m]

Fig. 7.

Estimated trajectories in outdoor experiment.

B. Real-world experiment
In this experiment, we used a self-developed sensor suite
to demonstrate our framework. The sensor suite is shown in
Fig. 6. It contains stereo cameras (mvBlueFOX-MLC200w,
20Hz) and DJI A3 controller2 , which inculdes inbuilt IMU
(200Hz) and GPS receiver. The GPS position is treated as
ground truth. We hold the sensor suite by hand and walk
2 http://www.dji.com/a3

TABLE II
RMSE[ M ] IN OUTDOOR EXPERIMENT.

Sequence

Length

outdoor1
outdoor2
outdoor3

223.70
229.91
232.13

stereo
1.85
2.35
2.59

Proposed RMSE
mono+imu
stereo+imu
0.71
0.52
0.56
0.43
0.65
0.75

In this paper we have presented a general optimizationbased framework for local pose estimation. The proposed
framework can support multiple sensor combinations, which
is desirable in aspect of robustness and practicability. We
further demonstrate it with visual and inertial sensors, which
form three sensor suites (stereo cameras, a monocular camera
with an IMU, and stereo cameras with an IMU). Note
that although we only show the factor formulations for the
camera and IMU, our framework can be generalized to
other sensors as well. We validate the performance of our
system with multiple sensors on both public datasets and
real-world experiments. The numerical result indicates that
our framework is able to fuse sensor data with different
settings.
In future work, we will extend our framework with global
sensors (e.g. GPS) to achieve locally accurate and globally
aware pose estimation.

R EFERENCES
[1] G. Klein and D. Murray, “Parallel tracking and mapping for small ar
workspaces,” in Mixed and Augmented Reality, 2007. IEEE and ACM
International Symposium on, 2007, pp. 225–234.
[2] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct
monocular visual odometry,” in Proc. of the IEEE Int. Conf. on Robot.
and Autom., Hong Kong, China, May 2014.
[3] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European Conference on Computer Vision.
Springer International Publishing, 2014, pp. 834–849.
[4] R. Mur-Artal, J. Montiel, and J. D. Tardos, “Orb-slam: a versatile and
accurate monocular slam system,” IEEE Trans. Robot., vol. 31, no. 5,
pp. 1147–1163, 2015.
[5] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2017.
[6] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint Kalman
filter for vision-aided inertial navigation,” in Proc. of the IEEE Int.
Conf. on Robot. and Autom., Roma, Italy, Apr. 2007, pp. 3565–3572.
[7] M. Li and A. Mourikis, “High-precision, consistent EKF-based visualinertial odometry,” Int. J. Robot. Research, vol. 32, no. 6, pp. 690–711,
May 2013.
[8] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,
“Keyframe-based visual-inertial odometry using nonlinear optimization,” Int. J. Robot. Research, vol. 34, no. 3, pp. 314–334, Mar. 2014.
[9] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual
inertial odometry using a direct ekf-based approach,” in Proc. of the
IEEE/RSJ Int. Conf. on Intell. Robots and Syst. IEEE, 2015, pp.
298–304.
[10] R. Mur-Artal and J. D. Tardós, “Visual-inertial monocular slam with
map reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp.
796–803, 2017.
[11] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold
preintegration for real-time visual-inertial odometry,” IEEE Trans.
Robot., vol. 33, no. 1, pp. 1–21, 2017.
[12] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile
monocular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34,
no. 4, pp. 1004–1020, 2018.
[13] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in realtime.” in Robotics: Science and Systems, vol. 2, 2014, p. 9.
[14] C. Kerl, J. Sturm, and D. Cremers, “Dense visual slam for rgb-d
cameras,” in Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and
Syst.
[15] H. Rebecq, T. Horstschaefer, G. Gallego, and D. Scaramuzza, “Evo: A
geometric approach to event-based 6-dof parallel tracking and mapping
in real time,” IEEE Robotics and Automation Letters, vol. 2, no. 2,
pp. 593–600, 2017.
[16] S. Lynen, M. W. Achtelik, S. Weiss, M. Chli, and R. Siegwart,
“A robust and modular multi-sensor fusion approach applied to mav
navigation,” in Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and
Syst. IEEE, 2013, pp. 3923–3929.
[17] G. P. Huang, A. I. Mourikis, and S. I. Roumeliotis, “Observabilitybased rules for designing consistent ekf slam estimators,” Int. J. Robot.
Research, vol. 29, no. 5, pp. 502–528, 2010.
[18] S. Shen, Y. Mulgaonkar, N. Michael, and V. Kumar, “Multi-sensor fusion for robust autonomous flight in indoor and outdoor environments
with a rotorcraft MAV,” in Proc. of the IEEE Int. Conf. on Robot. and
Autom., Hong Kong, China, May 2014, pp. 4974–4981.
[19] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
“g2o: A general framework for graph optimization,” in Proc. of the
IEEE Int. Conf. on Robot. and Autom. IEEE, 2011, pp. 3607–3613.
[20] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and
F. Dellaert, “isam2: Incremental smoothing and mapping using the
bayes tree,” Int. J. Robot. Research, vol. 31, no. 2, pp. 216–235, 2012.
[21] H. Liu, M. Chen, G. Zhang, H. Bao, and Y. Bao, “Ice-ba: Incremental,
consistent and efficient bundle adjustment for visual-inertial slam,” in
Proc. of the IEEE Int. Conf. on Pattern Recognition, 2018, pp. 1974–
1982.
[22] J. Shi and C. Tomasi, “Good features to track,” in Computer Vision
and Pattern Recognition,1994. IEEE Computer Society Conference on,
1994, pp. 593–600.
[23] B. D. Lucas and T. Kanade, “An iterative image registration technique
with an application to stereo vision,” in Proc. of the Intl. Joint Conf.
on Artificial Intelligence, Vancouver, Canada, Aug. 1981, pp. 24–28.
[24] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.
org.

[25] G. Sibley, L. Matthies, and G. Sukhatme, “Sliding window filter with
application to planetary landing,” J. Field Robot., vol. 27, no. 5, pp.
587–608, Sep. 2010.
[26] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
driving? the kitti vision benchmark suite,” in Proc. of the IEEE Int.
Conf. on Pattern Recognition, 2012, pp. 3354–3361.
[27] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.
Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”
Int. J. Robot. Research, 2016.
[28] B. K. Horn, “Closed-form solution of absolute orientation using unit
quaternions,” JOSA A, vol. 4, no. 4, pp. 629–642, 1987.

