locked
App locks up (no watchdog timeout) when running on Android 6 using Xamarin 4+ RRS feed

  • Question

  • User139501 posted

    I work with a pretty large enterprise application (around 800k LOC) and a release APK of 65MB. We've only really worked with Android 4.4 KitKat and 5 Lollipop. Recently one of our Samsung devices updated to Android 6 Marshmallow and started exhibiting some freeze up issues. Basically our app runs in Kiosk mode, so we are the home launcher. The user doesn't see or access the android OS. Our app stays running and on throughout the day for our users, so we force the screen to stay on, use wakelocks, etc. Basically when our app is running, after a period of time ranging from 20 minutes to a day, the app will just freeze. The UI will be frozen and our process stops doing work. We also run in two processes, one being a service. The service will keep running, but only process a small amount of data as the two processes are dependent on each other.

    What's interesting is that when building our app using Xamarin 3.9, the lock up issue does not occur running the exact same code, same devices. Keep in mind, Xamarin 3.9 was PRE Android 6 Marshmallow. We've got this to reproduce on multiple Android 6 devices so far.

    Our app does a lot of logging, and I've spent a lot of time in the Logcat as well, basically living in the Android Device Monitor the last couple of weeks monitoring everything. There is literally nothing in either logs that indicate any sort of issue. The app just stops and freezes. It never triggers an Android "This app is not responding, would you like to close it?" dialog either when it happens. If you try tapping a few times on the screen it will eventually trigger the dialog though.

    What I've tried so far:

    • Xamarin versions running from 4.0.4.4 to the latest currently (4.2.0.719).
    • Changing target/compile to Android 6 and fixed any deprecated/obsolete code.
    • Implemented the Android 6 permission scheme thinking it may be a permission issue
    • Verified this issue ONLY occurs on Android 6 (doesn't occur on 4.4 or 5 on various devices). Haven't tried 7 yet.

    One thing to note, our app uses a lot of permissions, and does try to set the system time along with other system changes. We don't have access to do so on some devices. We have a custom Android device we use as well that we have system privileges over so we can do this.

    I am currently in the process of creating a standalone app that mimics some of the things our app does while sitting idle. Mainly running multiprocess with intents getting sent to broadcast receivers to communicate between processes. Hopefully I can recreate it so I can share this app with Xamarin and anyone else interested.

    Being this is an enterprise app, my hands are tied as far as what I can share. I have plenty of logcat logs showing where our app falls on its face and can gladly provide snippets of that!

    My ask is, anyone have any recommendations to try next? Is there any Xamarin advanced logging I could try? At this point it seems it's Xamarin related due to certain versions of Xamarin having or not having the issue. Though maybe something introduced in Android 6 doesn't like what we're doing. Nothing in the Android 6 release notes seem to hint any major changes that'd effect us.

    Sorry for the long post, and appreciate any input! Thanks in advance!

    Thursday, November 17, 2016 2:39 PM

All replies

  • User12817 posted

    Recently one of our Samsung devices updated to Android 6 Marshmallow and started exhibiting some freeze up issues.

    Two things immediately come to mind.

    1. Samsung Devices are notorious for being different than other manufacturer devices when it comes to Android development.

    2. You are migrating a project to a complete major version of Android. (4.4/5.0 -> 6.0)

    Basically when our app is running, after a period of time ranging from 20 minutes to a day, the app will just freeze. The UI will be frozen and our process stops doing work.

    What have you done to identify what's happening before the freeze/jank? There are plenty of ways to go about this, but I would recommend starting here:

    https://developer.android.com/training/testing/performance.html

    https://www.youtube.com/watch?v=HXQhu6qfTVU

    https://developer.android.com/studio/profile/systrace.html

    If this isn't an actual ANR, then we will need logcats to see what's going on to the point of this process being killed in the adb logcat.

    What's interesting is that when building our app using Xamarin 3.9, the lock up issue does not occur running the exact same code, same devices. Keep in mind, Xamarin 3.9 was PRE Android 6 Marshmallow. We've got this to reproduce on multiple Android 6 devices so far.

    Let's clarify some parameters here. For example if we had two .apks:

    A. Built with Xamarin 3.9

    B. Built with Xamarin 4.0+

    Note: We are comparing Xamarin for Visual Studio(XVS) with these version numbers. There is a different Xamarin.Android version that we would want to gather as well. If you can include that with your current 3.9 build and 4.0+ build, that would help us further.

    Our app does a lot of logging, and I've spent a lot of time in the Logcat as well, basically living in the Android Device Monitor the last couple of weeks monitoring everything. There is literally nothing in either logs that indicate any sort of issue. The app just stops and freezes. It never triggers an Android "This app is not responding, would you like to close it?" dialog either when it happens. If you try tapping a few times on the screen it will eventually trigger the dialog though.

    Basically you're saying that your application never goes into an ANR(Application Not Responding) state. However it technically should be in that state if the UI locks up for a long enough time. Does the application get backgrounded for other applications or is it always in the foreground? If so, it could be possible that after a long enough time, the Android OS might kill the application for memory purposes (If struggling for memory). However I'm going to stick with the ANR theory.

    Verified this issue ONLY occurs on Android 6 (doesn't occur on 4.4 or 5 on various devices). Haven't tried 7 yet.

    So if this is true, if you had an .apk built with Xamarin 3.9 here and installed it on a Android 6.0 device(Since you could install the pre-marshmallow app on marshmallow), does this reproduce the issue? If we can confirm that, then it's unlikely that the change of Xamarin.Android between XVS 3.9 -> 4.0 affected this issue, but rather it's the Android version that is causing the problem. You would also want to try testing this against 7.0 if you can, not a big deal but it could potentially be something that was introduced in Android 6.0 and may have been fixed in Android 7.0(Not much to go off yet).

    I have plenty of logcat logs showing where our app falls on its face and can gladly provide snippets of that!

    Please upload as many as you can possibly upload. Please label them accordingly as it might be hard to distinguish the device, API level, etc. I would recommend naming them as: {timestamp}-{device}-{api level}-{success/failure} or something similar.

    Is there any Xamarin advanced logging I could try?

    Sadly no. The best thing you could do is create a logging pipeline via the use of HockeyApp / other third party logging framework. That way you can be notified when crashes happen instead of staring at ADM(Android Device Monitor) waiting for something to fail. However it might not cover this scenario, so having adb logcat running to the point of a crash is always a great thing to have as well.

    Friday, December 2, 2016 11:49 PM
  • User139501 posted

    @JonDouglas

    In the attached zip file I managed to get the lockup to occur during a systrace. Sorry that it's so long, but I had to run a long interval to try catch the lockup.

    At the 82 second mark the "Interactions" line starts a rendering response that never seems to process, eventually building up stuff on the UI Thread that can't process, and eventually everything stops dead at the 518 second part (right where our app freezes). I didn't catch the logcat during this unfortunately. I will try reproduce again and include the system trace and the associated logcat.

    Also, looking at the CPU's, CPU 0 stops processing things at around 80 seconds in, CPU 1 stops around 129 seconds, CPU 2 stops at 166 Seconds. Potentially CPU 3 stopping at 518 seconds.

    I had the systrace run for 1200 seconds, so after 518 seconds in the white space the systrace is still running, just nothing is happening.

    So if this is true, if you had an .apk built with Xamarin 3.9 here and installed it on a Android 6.0 device(Since you could install the pre-marshmallow app on marshmallow), does this reproduce the issue?

    An .apk built using Xamarin 3.9 installed on an Android 6.0 device does NOT exhibit the freeze/lock issue we're seeing. Only once built with 4.0+ will the issue occur. Devices running 4.4 and 5.0 do not exhibit the issue regardless of Xamarin versions.

    Also, the Xamarin 4.0.4.4 version we use primarily uses Xamarin.Android 6.0.4.0. The Xamarin 3.9.274 version that doesn't exhibit the issue has Xamarin.Android 4.20.0.34

    Also, I injected the various doze modes and none of them cause any negative side effects to our application. Our device usually is plugged in and screen on anyways. The lockup/freeze happens with the screen on.

    Let me know if there's anything else you need to know. I'll work on reproducing the systrace combined with the logcat.

    Thanks!

    Monday, December 5, 2016 5:56 PM
  • User12817 posted

    @JustinWojciechowski

    Thanks for this systrace. It gives a little bit of insight as to what's going on. However I do believe an adb logcat accompanying it will provide even more useful. We can obviously see that interactions and other threads are halting around the 518 s mark.

    Just for a small gander, it looks like there are quite a few alerts and frames that are telling you that you have "Jank" in your application. i.e. You are processing too much work on the UI Thread and thus it locks up.

    EX:

    Alert - Scheduling delay Running - 9.620 ms Not scheduled, but runnable - 2,986.777 ms Sleeping - 1,507.334 ms Frame
    Description - Work to produce this frame was descheduled for several milliseconds, contributing to jank. Ensure that code on the UI thread doesn't block on work being done on other threads, and that background threads (doing e.g. network or bitmap loading) are running at android.os.Process#THREAD_PRIORITY_BACKGROUND or lower so they are less likely to interrupt the UI thread. These background threads should show up with a priority number of 130 or higher in the scheduling section under the Kernel process.

    This example if added the sum took 4503.731 ms to complete. That's also known as 4.5 seconds. You can imagine what the operating system might do to an application if it did this often. Let's go over what the OS will do really quick:

    No response to an input event (such as key press or screen touch events) within 5 seconds.

    A BroadcastReceiver hasn't finished executing within 10 seconds.

    These alerts connected to a respective frame are all over your application. EX: 480->518s range. To me this sounds like a potential culprit of the application taking too long and causing an ANR on a more "unforgiving" time. Thus it would sound like this issue would be very "sporadic" and not always happen at the 518s time. But rather it would happen at a time where the Android OS decides to finally kill the application because it went a few more seconds over the the OS limit. For example here is a very unperformant Frame in the 480->518s range:

    Measure/Layout - 13.831 ms - This is fairly normal. However we care about the Running/Sleeping/Not Scheduled, but runnable aspect of this frame which shows:

    Running - 36.411 ms

    Thus this frame took longer than it should've and caused a delay in other Frames(Or was delayed itself). Remember that the boundary set is 16 ms to achieve 60FPS.

    If we watch the following video - https://www.youtube.com/watch?v=we6poP0kw6E we know that we might want to re-visit some of our layouts / animations and flatten them out via Hierarchy Viewer. We can also usetraceview to see potential cpu code missed frames.

    There's also a great talk on this that you can find here: http://blog.udinic.com/2015/09/15/speed-up-your-app

    Although it sounds like there could be a change between Xamarin 3.9 -> 4.0, it sounds like you are doing too much work on the UI thread and really need to move most of that work to the background.

    Note: Is there any reason why you aren't using the latest Stable version of Xamarin? You are on quite an old build since Xamarin.Android 7.0 was released.

    Monday, December 5, 2016 7:11 PM
  • User139501 posted

    @JonDouglas

    Thanks for the quick response. I'll take a peek through those links. The speed up your app blog has some awesome info. I'll work on using some of those tools. Our app is huge and has a lot of developers working on it, so it's hard to keep up with making sure things are done correctly as far as performance goes.

    We typically try to stay on a version of Xamarin that works well for us. Our app requires high stability and up time, and some Xamarin versions have introduced a lot of issues for us, so when we find one that works we stay on it. I am trying to move us up to the latest, but it's a slow and very complex project that requires a LOT of testing and validation.

    I'll keep working on my end with the info you've provided and see if I can do some optimization. It'll take some time though as almost a million lines of code takes a bit to analyze :P

    I appreciate the input though, I'll keep troubleshooting on my end and also try to get a logcat combined with the systrace (and anything else you find valuable).

    Thanks again!

    Monday, December 5, 2016 7:39 PM
  • User12817 posted

    @JustinWojciechowski

    Sounds great. Please let me know your progress and anything else I can help with. I would highly recommend that you attempt to migrate to the latest Stable as some of these issues can linger from over half a year's worth of changes with respect to 4.0.4.4 -> current stable.

    Monday, December 5, 2016 9:12 PM
  • User139501 posted

    @JonDouglas

    Out of curiosity, what would cause CPU's to stop processing any tasks? I've been working with some other engineers analyzing this trace and we're all confused as to why a CPU would stop like shown in the trace.

    I've started doing some optimizations and starting to see improvements already as far as performance goes. That being said, I created an app that pounds the UI thread and does a ton of work. Basically drains all the CPU's constantly. It throws a ton of alerts in system trace as it exceeds the 16ms timer. This app will never lock up nor will it cause any CPU's to stop processing. It seems that if we are running too much on the UI thread, that the system shouldn't just start shutting down CPU's. It looks like we're functioning fine until the first CPU stops. No alerts in the first 80 seconds, one yellow alert for a long draw. Once the first CPU stops it seems to just all go down hill from there. It seems like we get backed up and things start taking longer to process, eventually running on one CPU and it can't keep up.

    I'll keep troubleshooting on my end and trying to find the culprit, just confused as to the CPU shut downs we see in the trace. What is really interesting is this is happening as our app is just sitting idle. It's doing some basic background tasks (should mostly be running in the background on their own threads) and occasionally doing small updates to the GUI. We also have a demo mode which dumbs down everything and doesn't do nearly as much, but still has the lock up.

    My concern is understanding the first CPU shutdown and then working from there.

    Again, thanks for the help! It's much appreciated.

    Wednesday, December 7, 2016 3:15 PM
  • User12817 posted

    @JustinWojciechowski

    To understand that, we need to take a step back and remember that Systrace is a "System Overview". Thus the CPU can be busy with other application's work and it may halt to go work on "more important" at the time, threads.

    I am almost convinced that this Systrace was only for the duration of the application crashing. Thus there's not a great way to see what the other CPUs were doing. I believe the reason why the CPU would display "stopping" for this app is because it crashed and killed the process and then moved on to do work for something else.

    It seems that if we are running too much on the UI thread, that the system shouldn't just start shutting down CPU's.

    Here's the thing about this. If you skip enough frames to the point where the user interacts with the application and it takes the application >= 5 seconds to respond, that's when you get the ANR from the Android OS. Thus nipping these items in the bud are very critical from a performance and usability perspective.

    We actually see the start of your "Jank" around the 118s mark and significantly getting worse each 5-10s. (You can see all of the "Red" Frames). It then escalates to an "uncatchable" rate once it hits 148s and then all these frames are skipped. So something is going on at that time, which I am not sure of exactly other than what Systrace tells me regarding the Scheduling Delay. At this point the best thing to do is to mark ("m") the troublesome frames and dig deep into ensuring they can be optimized for < 16 ms.

    I'm curious as to how much work is really being put on the main thread. Are you scheduling / delegating background work to be put on the main thread when you're ready for UI updates/etc? It seems like there's a large amount of main thread work.

    Wednesday, December 7, 2016 4:16 PM
  • User139501 posted

    @JonDouglas

    I figured I'd throw out what I found in here in case you were curious.

    Turns out there was some code running (a timer firing every 2 seconds) that would lock on a dumby object and within the lock would call a garbage collection. For some reason this code running against the newer Xamarin version along with Android 6+ was causing the lockup. I'd speculate from my readings that the changes made in Xamarin included changes to garbage collection and caused us issues. Definitely was our problem though, this was old code that was ported (rather poorly) and just needed some overhauling. I would guess that the dumby object was getting garbage collected and getting stuck in the lock as it could never "unlock" itself without the object. That or some odd behavior around garbage collecting within a lock.

    It essentially looked like this:

    private object _obj= new object();
    
    private void TimerElapsed(object state)//2 second timer
    {
        lock(_obj)
        {
            //logged entering garbage collection
    
            GC.Collect();
    
            //logged leaving garbage collection (never logged after lockup)
            //code here that didn't run after the lockup
        }
    }
    

    My fix was to just remove the lock and garbage collection as it wasn't necessary. Things are working perfectly now. Was extremely difficult to find as there was no logging around this specific code (I added some later) and it never threw any exceptions or anything during a debug session. Had to essentially turn everything off in our system and enable things slowly until I found the area causing the issue. Then I added logging around this as it seemed odd, and sure enough it would log before the garbage collection and never again after it if the screen locked up.

    Wednesday, December 28, 2016 8:12 PM