Measuring mobile apps performance in production

Published in

Booking.com Engineering

9 min readJan 11, 2024

“People using your app expect it to perform well. An app that takes a long time to launch, or responds slowly to input, may appear as if it isn’t working or is sluggish. An app that makes a lot of large network requests may increase the user’s data charges and drain the device battery. Any of these behaviors can frustrate users and lead them to uninstall the app.”
Apple Documentation

App performance is an integral part of the user experience. An app that’s prone to freezing or takes ages to launch won’t satisfy our customers. If the waiting time to load search results or the hotel details screen is too long, it could detract from the excitement of planning upcoming vacations. This is something we would definitely prefer to avoid. However every new feature could slightly degrade app performance and certain changes might have a greater impact, which can get out of control.

The key aspect of mitigating performance issues in mobile apps is proper monitoring; otherwise, any effort to improve or preserve the performance would turn into jumping in at the deep end.

A brief history of the App Performance team

At Booking.com we’ve been monitoring app performance metrics for quite some time. For instance, the first iOS startup time metric was introduced in 2016. Around 2019 the team responsible for monitoring & improving the performance was created.

By 2021, the team realized that the existing setup for performance monitoring was quite obsolete, unreliable and didn’t fully fit our requirements, so it needed a revision.

While addressing the functional improvements in metrics, we also decided to rewrite the performance libraries completely, simultaneously transitioning from the older Objective-C/Java to the modern Swift/Kotlin languages. Throughout this process, we designed our libraries to be fully independent of other Booking infrastructures, injecting external dependencies like experimentation, storage, and networking.

Why not use existing third-party tools

There’s no shortage of free or paid tools to monitor app performance. Apple and Google offer some out-of-the-box monitoring solutions, and there are a few big third-party players such as Firebase Performance.

However, our monitoring tool had three primary requirements, all of which were related to integrating with the Booking infrastructure and covering specifics of our development culture:

Experimentation: Given the strong experimentation culture in the company, the most critical requirement was to ensure reliable and simple integration with our internal experimentation infrastructure.
Alerting: We require a flexible alerting/SLA system that can promptly notify different teams about performance degradations. Metrics should come from real devices as quickly as possible to ensure timely alerting.
Flexibility: We wanted the option to customize existing metrics defined by Apple & Google. For instance, on iOS, we aim to differentiate between app startup time and launch time of the first screen. This means we end startup time measurements sooner than Apple’s default monitoring system does.

Sadly, no third-party solutions met even two of our three criteria. So, we needed to implement our own solution.

What we measure

After recognizing the need for performance monitoring, we had to determine the metrics to track. Every metric should address a user pain point. We identified two primary user concerns: wait time and interface smoothness. This led us to focus on three primary metrics:

App startup time
Screen time to interactive (TTI)
Frame rendering performance

The core business of our company is to provide fast and reliable search and booking services. So we have gathered enough data from real users to validate that significant degradation of these metrics often impacts the conversion, and almost always negatively affects user engagement metrics.

App startup time

The App Startup Time metric measures the time in milliseconds (ms) from the user tapping the app icon on their Home screen till the app draws its first frame.

Both platforms also differentiate between “cold” and “warm” app starts, but our main focus was to improve the “cold” launches, which is when the system cannot benefit from the app state previously cached in memory. That’s because the “warm” starts mostly depend on the performance of specific screens opened first when a user returns to the app (and as you will see in the next section we measure this metric separately anyway).

More details & official recommendations regarding app startup process for every platform you could find in official developer documentations (iOS & Android).

Time to Interactive (TTI)

Time to Interactive (TTI) is the time in milliseconds (ms) spent from the screen’s creation start until the first frame of meaningful content is rendered, ensuring that:

All internal setup is ready
UI is laid out and rendered
Most important data has arrived and is shown to the user
Main thread is ready to process incoming events

TTI always includes native app performance, and for several particular screens, it may also depend on network or storage performance. This metric allows us to monitor how fast customers can actually start using the screen for its main purpose.

Initially, this metric was defined by Google for web-development but we’ve found it very useful and perfectly suitable for mobile apps as well.

To investigate degradations in TTI, we need to understand the reasons behind them. For this purpose, we use supportive metrics. We monitor the wall-clock time and the latency of every network request related to screen loading, which helps us identify degradations caused by the backend. For screens that involve heavy read/write storage operations, it also makes sense to monitor storage performance separately.

Additionally, we use the Time To First Render metric to pinpoint degradations caused by screen creation and rendering (see the next section).

Time To First Render (TTFR)

Time To First Render (TTFR) is the time in milliseconds (ms) spent from the screen’s creation start until the screen renders its first frame.

It starts at the same time as the general TTI measurement but may stop earlier. In the most common case, the screen should be ready to be drawn as soon as possible, but it shouldn’t necessarily show meaningful content immediately. Usually, the screen can show some “progress indicator” and do some heavy initializations in the background. We stop TTFR tracking once the very first frame is drawn, so the metric is pretty close to measuring the screen’s creation time. It allows us to prevent the UI thread from freezing during creation, which leads to a better user experience.

This metric directly impacts TTI and may impact Rendering Performance as well.

Rendering performance

To ensure that a user’s interaction with an app is smooth, the app should render frames in under 16 ms to achieve 60 frames per second (note: on many modern devices, the target might be set to 90 or 120 fps due to higher display frame rates, but we will refer to 60 fps in this article). If the frame rendering time exceeds 16 ms, then the system is forced to skip frames and the user will perceive stuttering in the app.

Let’s try to visualize how the app renders frames on the timeline:

There are 2 main factors that have an impact on how bad the rendering performance might be:

Freeze duration: Identifies how long it takes to render a single slow frame. The longer it takes, the more noticeable performance problem is (e.g. frame that renders for 500ms looks much worse than the frame that renders for 32ms).
Freezes count or freeze frequency: Identifies how many or how often users experience slow frames. The more often users experience the same slow frame with the same freeze duration the worse it looks.

Considering these 2 factors we can come up with the definition for the metric which represents the rendering performance quite accurately:

Freeze Time: The total time of UI freezing due to the rendering of slow frames per screen (or app) session.

In the illustration above we see 6 frames: 3 frames are good and 3 frames have freezes. That means that 3 good frames were rendered within 16ms and 3 other frames have freezes of different durations. We calculate freeze duration as a difference between the actual frame duration and 16ms of the target frame duration. To calculate total freeze time we need to summarize the durations of all freezes that happen on the screen.

Freeze Time can be the same with different patterns: 1 freeze with 1000ms, 100 freezes with 10ms. Also, freeze time can increase without any additional change just by increased session duration (e.g. when every item of the scrollable list generates some slow frames and the user starts to scroll more it leads to a higher total freeze time).

To catch such situations we are also using 2 additional metrics:

Freeze Count: The total count of slow frames (slower than 16ms) during a screen session. To see if the pattern of freezes changes.
Session Duration: The screen session duration. To see if session duration changed which might cause freeze time changes.

Rationale for choosing the Freeze Time metric

Both Google and Apple offer metrics for assessing rendering performance. Initially, we adopted a method implemented by Firebase for our rendering performance monitoring, which involved tracking slow frames (>16ms to render) and frozen frames (>700ms). However, we discovered that these metrics did not adequately capture rendering performance degradation.

For instance, consider a scenario where views in a list are already slow, requiring 20ms for rendering. If the rendering time increases to 300ms, the metrics would still report one slow frame per view without any frozen frames, failing to indicate a significant deterioration in rendering time.

Moreover, there is an inconsistency in how performance changes are reflected. A view’s rendering time increasing from 15ms to 20ms is recorded as the same metric change as an increase from 15ms to 300ms, which does not accurately represent the severity of the slowdown.

Apple’s “Hang rate” metric, which is calculated as seconds of hang time per hour, appeared to be more in line with what we needed. It resembles our Freeze Time metric but is normalized by dividing the total freeze time by the session duration. However, this normalization caused the metric to become overly sensitive to changes in user behavior.

For instance, if a product feature causes users to spend more time scrolling through a slow list, the Hang rate may show an improvement because the session duration increased, even though the user experience has degraded due to more freezes.

After encountering various scenarios where the relative metric did not provide a clear picture of performance, we decided to use an absolute metric instead. This allows us to measure rendering performance more accurately, not just for the entire application but for each screen session, without the results being skewed by user behavior or session length.

The absolute metric has certain limitations too. For instance, we can take the same example: if a product feature results in users scrolling through a slow list more frequently, the rendering metric would worsen even though there hasn’t been a technical decline in performance. However, incorporating a supplementary metric Session Duration allows us to manage these situations effectively.

The main idea behind this is that we consider any increase in Freeze Time as a negative performance change, regardless of the reason (though ideally, the user shouldn’t see any freezes at all). Of course, it is important to react to new performance issues caused by a new feature, but it is also important to detect the old screen producing more freezes because users start to interact with it more actively.

Show me the code!

Wrapping up all the theoretical knowledge and clear metric definitions we finally can implement the working solution for collecting these metrics. We’ve recently open-sourced our performance tracking libraries for both platforms, which you can find on GitHub:

These light-weight libraries help to gather aforementioned metrics and allows you to report them to any analytics system. We are actively working on further improvements and new features (e.g. terminations monitoring), however, it is already used in the main Booking app on both platforms. The iOS library is also used in our app Pulse for the property owners and in the Agoda app from our sister company.

Feel free to try it out, leave feedback, or even better, contribute!

Vadim Chepovsky and Gleb Tarasov