Skip to main content

2 posts tagged with "ANR debugging"

View All Tags

How to Detect and Debug ANRs That Only Appear in Production on Low-Memory Android Devices

Published: · 7 min read
Sandra Rosa Antony
Software Engineer, Appxiom

When a critical user action triggers a complete UI freeze, and Android displays the “App Not Responding” (ANR) dialog, production dashboards may log thousands of affected sessions - but attempts to reproduce the issue on local emulators or on recent test devices fail. Inspection of the affected production devices shows they predominately have ≤2 GB RAM and are running Android versions with aggressive low-memory management. Standard QA and staging are unable to surface the freeze, leaving engineers with only anonymized stack traces from Play Console and no actionable repro steps.

ANRs on Low-Memory Devices: Manifestations and Misconceptions

ANRs are triggered when an app’s main thread is blocked for over 5 seconds (in activity context) or relevant background threads violate system timeouts. On low-memory (or “low-RAM”) Android devices, ANR rates are disproportionally higher. These devices exhibit system-wide memory pressure, causing frequent background process kills, rapid garbage collection cycles, and unpredictable heap eviction behavior. A common misconception is that resource bottlenecks only manifest as OOM (Out Of Memory) crashes, but in practice, sustained memory thrashing can starve the main thread, delaying message dispatch and causing downstream lock-ups ending in ANRs.

Engineers often discover, through logs, that problematic sessions correlate with lower available RAM and aggressive background process culling (ActivityManager.isLowRamDevice() returns true). In this environment, even fast, local memory allocations can trigger system-induced stalls.

Real World Signal: Interpreting Production ANR Reports

Play Console aggregates ANR data but only surfaces stack traces for the moment of the freeze - not the full causal chain. Typical traces show the main thread stuck on wait conditions, disk I/O, or long-running JNI calls, but provide little situational context:

"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0...
at android.os.MessageQueue.nativePollOnce(Native Method)
at android.os.MessageQueue.next(MessageQueue.java:336)
at android.os.Looper.loop(Looper.java:163)
at android.app.ActivityThread.main(ActivityThread.java:6349)
...
at com.example.app.util.ImageCacheLoader.decodeImage(ImageCacheLoader.java:92)

This is insufficient to reconstruct the memory conditions, heap state, or GC behavior that led up to the freeze. ANR reporting from Android is delayed by design and reflects only the stuck thread, not the systemic context at the time. Engineers need to correlate these main-thread stack traces with system-level metrics (available memory, background GC, process lifetime) to be actionable.

Gathering Context Remotely: Traces, Metrics, and Proactive Signals

To bridge diagnostic gaps in production, advanced teams employ a mix of remote tracing, custom metric reporting, and log enrichment. Integration of a lightweight remote logging library that captures:

  • Free/total heap size via Debug.getNativeHeapFreeSize()
  • GC count via Debug.getGlobalGcInvocationCount()
  • Per-thread CPU/IO usage via /proc/self/task stats
  • System memory class via ActivityManager.MemoryInfo

enables engineers to reconstruct the environment leading to ANRs. For high signal, these samples should be recorded not just on fatal signals, but regularly (with throttling to avoid perf overhead) and tagged to session IDs.

Example of custom log event on each activity start:

val runtime = Runtime.getRuntime()
val memInfo = ActivityManager.MemoryInfo()
activityManager.getMemoryInfo(memInfo)

Log.i("MemSignal", "freeMemory=${runtime.freeMemory()} totalMemory=${runtime.totalMemory()} " +
"availMem=${memInfo.availMem} lowMemory=${memInfo.lowMemory} Class=${memInfo.memoryClass}")

When the backend links these logs to users who report freezes, patterns begin to emerge - a declining heap, multiple forced GCs, or coincident large bitmap decodes preceding the freeze.

Simulating Memory Pressure: Reproducibility Limitations and Emulation Gaps

Simply running apps on typical emulators or recent flagship phones misses many production conditions. Android’s emulator (“AVD”) allows memory class simulation, but it doesn’t reliably model every aspect of low-RAM device scheduling, cgroup memory restrictions, or system-initiated background process termination. Engineers need to push beyond standard tools.

Two effective strategies:

  1. Manual Memory Pressure: Use third-party tools like LeakCanary to allocate large buffers and fragment the heap during testing, observing at what point UI tasks begin to starve.
  2. ‘kill-all’ Background/Foreground Cycling: Utilize adb shell am kill-all and frequent task-switching to force the app through repeated lifecycle events. Low-memory devices often trigger cleanup and process recreation side effects not seen elsewhere.

While not perfectly matching production, this method surfaces code paths and resource use patterns that hang in low-resource situations.

Targeted Fixes: Engineering for Responsiveness Under Pressure

Profiling often identifies expensive on-demand resource allocation (e.g., bitmap decoding, large JSON parsing) on the main thread as core offenders. However, on low-memory systems, even “background” async work can trigger system GC or paging that indirectly blocks the main thread, due to shared allocator locks inside ART or the Linux kernel.

Key technical mitigations:

  • Move Large Allocations Off Main Thread: Verify all allocation-heavy operations are confined to thread or coroutine pools. Even lazy initialization routines must be re-examined for hidden main-thread coupling.
  • Detect and Throttle Heap Pressure: Employ a watchdog that rejects or defers work if freeMemory() drops below a threshold; gracefully degrade optional features or image resolutions.
  • Cache More Aggressively, But Lazily: Preload - rather than re-allocate - critical objects during application idle time or at explicit user interaction boundaries.
  • Explicitly Listen for Low-Memory Signals: Implement ComponentCallbacks2.onTrimMemory() to react to TRIM_MEMORY_RUNNING_CRITICAL events:
override fun onTrimMemory(level: Int) {
if (level >= ComponentCallbacks2.TRIM_MEMORY_RUNNING_CRITICAL) {
cache.clearNonEssential()
jobQueue.prioritizeUrgentWorkOnly()
}
}

Engineers must validate that clean-up routines triggered by memory pressure (such as image caches, pools, and job queues) don’t internally trigger main-thread stalls or deadlocks.

Connecting Diagnostics: Metrics, Logs, and Traces to Guide Fixes

A robust ANR debugging workflow depends on correlating runtime metrics, traces, and user activity leading up to the freeze window. Heap state, GC frequency, thread contention, and device-level memory pressure all help explain why an ANR occurred, but production debugging also requires visibility into when the freeze begins and what the user was doing immediately before it happened.

Appxiom’s ANR monitoring improves this visibility by detecting and reporting ANRs immediately when the UI thread becomes unresponsive, even before Android displays the system-level “App Not Responding” dialog to the user. This early detection helps engineering teams capture runtime state closer to the actual stall point instead of relying only on delayed system reports or post-mortem Play Console traces.

If the user force-closes the application after the ANR dialog appears, Appxiom raises a separate issue ticket reflecting the severity escalation. This distinction is useful operationally because it separates recoverable UI stalls from sessions where users explicitly abandon the app due to prolonged unresponsiveness.

In addition to ANR detection, Appxiom's Activity Trail feature helps reconstruct the execution path leading up to the freeze. Developers can manually mark important execution points, user actions, or high-risk operations inside critical flows such as image decoding, database access, subscription processing, or navigation transitions.

Example activity markers:

Ax.markActivity("subscription_checkout_started")

Ax.markActivity("fetching_entitlements")

Ax.markActivity("premium_dashboard_render")

These markers appear alongside ANR traces and runtime diagnostics, making it easier to correlate freezes with specific user actions or application states. Instead of analyzing isolated stack traces, engineers gain a chronological activity trail showing what occurred immediately before the UI became unresponsive.

Combined with runtime memory metrics, heap monitoring, and thread diagnostics, this creates a more actionable debugging workflow for production-only ANRs on low-memory devices. Teams can identify whether freezes correlate with bitmap allocation spikes, entitlement synchronization, disk I/O, excessive GC activity, or lifecycle transitions under memory pressure.

Trade-offs and Limitations

Despite intensive profiling and app-level patching, engineers must accept several realities:

  • Kernel and System Constraints: On very low-end hardware, system schedulers and kill policies can cause freezes independent of app logic.
  • Privacy and Overhead: Remote log and trace capture is limited by performance and privacy constraints; anonymization and sampling are essential.
  • Partial Observability: Some freezes are artifacts of vendor-specific ROMs or OS bugs beyond the app’s corrective scope.

The best strategy combines shoring up known allocation leaks, controlled feature degradation under memory pressure, and tight operational feedback loops.

Conclusion: Systematic Approach for Real-World Stability

Low-memory device ANRs surface only in production due to a complex interplay of system memory management, app-level resource use, and user-specific device histories. Detection and debugging require collection of targeted runtime metrics, simulated memory scenarios, and incremental, measured improvements. By connecting production traces to actionable device state and actively engineering for resilience under pressure, teams can meaningfully drive down ANR rates and improve app responsiveness across the device spectrum.

Using Android Vitals Metrics to Predict and Prevent Application Not Responding (ANR) Events

Published: · 6 min read
Appxiom Team
Mobile App Performance Experts

The Subtle Onset of an App-Numbing Outage

It usually begins as a faint uptick - a few ANR entries trickling into your Play Console. Dismissed initially as the cost of doing business ("There's always a background process hiccup, right?"), that number swells. By the next release, what was once an edge case now plots as a trend: churned users citing frozen screens, unresponsive tabs, rapid uninstall rates.

These moments, for a senior Android engineer, are never just about chasing an elusive stack trace. They’re lessons in understanding - the difference between reading numbers and reading what the numbers reveal about your systemic weaknesses.

From Metrics to Meaning: What Android Vitals Is Telling You

A mistake many teams make is treating Android Vitals as a passive dashboard - something to be checked post-mortem. But, in reality, Vitals is a living telemetry stream, a mirror for app health at scale. Each ANR metric is woven out of user experience: main thread stalls, excessive broadcast receiver work, read/write blocks.

Consider this excerpt from a Play Console telemetry snapshot:

ANR rate: 0.57% (90th percentile)
Highest correlation: BackgroundService Execution Time (p95: 6.2s)
Other signals: InputDispatching Timeout, ForegroundLaunch Delays

At first, the temptation is to dive straight into the most frequent offender in your logs. But this pulls you into a whack-a-mole game. Instead, experienced engineers look for patterns. For example:

  • Do ANRs cluster on particular device models, OS versions, or network conditions?
  • Are spikes correlated with long I/O traces on the main thread?
  • Is there a recurring background service or broadcast coinciding with user-initiated freezes?

The art is shifting from asking "Where did things go wrong?" to "What systemic stressors are manifesting in these metrics?"

A Real-World Failure: The Invisible Slowdown

Let’s ground this: Suppose, during a peak release, user complaints cite “tapping buttons does nothing,” but crash logs are oddly silent. You pull Android Vitals and find a hike in InputDispatchingTimeout ANRs. Checking logs like:

com.example.app ANR in com.example.app
Reason: Input dispatching timed out (Activity com.example.app.MainActivity)
Load: 1.25 / 1.09 / 1.00
CPU usage: 74% (user 52%, system 22%)

There’s no null pointer or crash - just a main thread suffocating, often because an innocent UI event triggered a heavy database migration or a sync operation on the UI thread.

The root cause? A subtle misconception: "If it’s a quick DB read, it’s fine on the main thread." Until, of course, it isn't - on slower devices or busy CPU cycles, that “quick” read can easily breach the 5-second input timeout.

The fix isn't just in refactoring that specific query off the main thread, but in systematizing a rule: All I/O, all DB reads, disk writes, and network checks should be main-thread forbidden, enforced via static analysis (like Android Lint rules) and with real-world spot checks using traces.

Beyond Symptoms: Proactive ANR Forecasting

ANRs are notoriously reactive: once they’re happening, user harm is done. The real challenge is investing in predictive signals.

A practical strategy: leverage the combination of Vitals percentile metrics and custom telemetry to catch suspects before the ANR threshold. For instance, by instrumenting key latency points:

val start = SystemClock.elapsedRealtime()
val result = doNetworkOrDiskOperation()
val duration = SystemClock.elapsedRealtime() - start

if (duration > 200) {
FirebasePerformance.logCustomMetric("heavy_operation", duration)
}

Now, correlate these custom metrics with Play Console’s “Slow rendering” or “Cold start” warnings. When you see rising tail latencies edging closer to ANR cutoffs (e.g., routine ops flirting with >4s), you have both macro-signals (Vitals) and micro-insights (bespoke metrics) to target.

Trade-off: Instrumentation adds some overhead and telemetry bloat, so target high-risk paths - not every single method.

Pitfalls of Focusing Solely on the Stack Trace

It's a rite of passage to over-index on the ANR stack traces Android provides:

"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x746f9bd0 self=0x7f8e21c000
| sysTid=13461 nice=-10 cgrp=default sched=0/0 handle=0x7f9871d4f8
at java.lang.Thread.sleep(Native Method)
at com.example.app.util.SyncHelper$job$1.run(SyncHelper.kt:42)

But the stack trace is less a cause, more a snapshot - a Polaroid of catastrophe at its peak. Deep problems - like resource contention, lock inversions, or dogpiled async work - unfold over seconds and aren't always represented here.

Smart teams use traces as starting points, but synthesize with:

  • System traces: Systrace or Perfetto logs reveal if main thread is starved for CPU due to background hogs (e.g., a foreground service spiking CPU).
  • ANR clustering: Are these traces frequent only on low-memory devices? Only after certain user flows?

Holistic ANR prevention comes from framing stack traces as symptoms within a broader system signature.

Strategies in Production: Mitigations and Feedback Loops

Let’s reimagine response not as a one-time fix, but as a virtuous feedback cycle.

1. Instrument and Alert: Inject custom latency metrics at high-risk operations (I/O, startup path, navigation transitions), aggregating to your observability platform. Set up alerts when operations flirt with your threshold, even if no ANR yet occurs.

2. Vitals-Driven Release Gates: Institute Play Console metrics as a release blocker - e.g., block rolling out to 100% if ANR rate breaches 0.5% in staggered rollouts.

3. Real User Monitoring: For large user bases, some behaviors can only be seen at scale. Integrate tools like Firebase Performance or Appxiom UX to overlay user session data and see the contextual triggers that diagnostics miss.

Connecting the Dots: System Signals You Should Be Watching

It’s tempting to rely solely on crash- or ANR-specific signals - but application responsiveness is a living, interdependent system.

What to watch:

  • ANR Rate (in Play Console): Overall health indicator
  • Slow Rendering/Startup > 5s: Early predictors of trouble brewing
  • RAM Usage and GC Spikes: Persistent memory churn raises stalls
  • Custom Async Operation Latency: Surface operations risking main thread waits

And crucially: connect these via dashboards - e.g., overlay ANR rate with percentile latencies from your own telemetry.

Example composite graph:

| Time        | ANR Rate | P95 I/O Latency | GC Pause/Min | Slow Startup Rate |
|-------------|----------|-----------------|--------------|------------------|
| 09:00-10:00 | 0.28% | 900ms | 180ms | 4.2% |
| 10:00-11:00 | 0.61% | 4,130ms | 410ms | 13.7% |

Notice that as P95 latency climbs, so does ANR rate - the canary singing long before disaster.

Evolving from Fixes to Resilience

What transforms a team from firefighting ANRs to engineering resilience? It’s the shift to thinking in terms of lead indicators. Vitals offers the forest; traces and custom telemetry map the trees.

Mitigation flows from proactive usage: blocking synchronous I/O, abuse-proofing background work, and making Play Console ANR stats as central to your workflow as CI tests. Even the best code reviews miss concurrency bugs that only real users exposed at scale.

Every ANR investigated is both a post-mortem and a guide - if you let the system’s metrics teach you. The payoff isn’t just green dashboards, but apps that feel snappy and trustworthy to millions - because you learned to listen before they started to freeze.