tl;dr: A very long C++ function with multiple local variables, even if they are not very big and they are placed in separate scopes, can reserve as much as hundreds of kilobytes of stack frame, causing "Stack Overflow" even without bugs like infinite recursion. So you better split your long functions into shorter ones.
Can refactoring (or the lack of thereof) cause application crashes? If we understand refactoring as changes in code layout without changing its logic, we might think that it's just the matter of readability and unreadable code increases chances of introducing bugs. But here is a story in which refactoring actually fixed a bug.
Long time ago in a software project far far away, there was a bug submitted telling that the application crashes with "Stack Overflow" message. It was a Windows app, developed in C++ using Visual Studio. I thought: - I can handle that, it should be easy! Every beginner/intermediate programmer knows about the call stack and surely seen this error at least once when accidentally caused infinite recursion in his code. So my first idea was that infinite recursion happens because of some logical error in the code (that should be easy to fix) or some unfortunate, invalid input data (that should be validated for safety before usage).
As it turned out, this was not the case. After setting up all the test environment and catching the crash in Visual Studio debugger, I looked at Call Stack and noticed that it looks quite normal. Sure the call depth was significant (as for C++, I'm not talking about Java here ;) and there was even some recursion, but 20 or 30 functions is not that much. The stack ended with a call to non-recursive function that seemed correct, so it was not the recursion that caused stack overflow.
My second idea was that some of these functions allocate some big objects (like arrays) by value, as local variables on the stack and this causes the stack to grow too big. I reviewed code of the functions that I found on the stack and used "Immediate Window" panel to quickly check sizeof(xxx) of variables or their types when they used some class, but I didn't find anything particularly big. Local variable sizes varied from few bytes to at most several hundred bytes and I couldn't find any big arrays defined in these functions. I also fetched address of some local variable in a function near the bottom of the stack (which looks like 0x000000000009a370), address of a parameter from the function at the top of the stack and subtracted them to see how big the stack grown over all these calls. The result was around 50 KB - not that much.
My third idea was to check maximum size of the stack. It is 1 MB by default, but it can be changed in Visual Studio project settings, in Linker > System tab, as "Stack Reserve Size" parameter. I check my project and I found this parameter not changed from its default value.
OK, now this became more difficult than I thought. After many debugging sessions, where I looked at various pointers, addresses and numbers trying to spot some memory override, stack corruption, out-of-bounds indexing etc., I finally opened "Disassembly" and "Registers" panels. I'm not a fan of such low level stuff, so it took me some time and few Google queries to understand these RSP, RBP registers and make sense of some x86-64 opcodes. While debugging step-by-step in the assembly, I found something interesting. At the beginning of my function, there was a call to mysterious function
__chkstk and the crash occurred inside it. That was a clue I could use to ask Google what this all means. I found this: Description of the stack checking for Windows NT-based applications and this: What is the purpose of the _chkstk() function? These articles say that as the stack grows, next 4 KB pages are reserved. Each next page is allocated by the system on first "touch". I could actually see in my debugger that functions which need less than 1 page (4096 B = 1000h) have an instruction at the beginning similar to this:
While my debugged function had this instead:
call __chkstk (018104AA00h)
The articles say that when reserving more than one page of stack memory, this function must be called to loop over addresses with 4 KB step and "touch" each page. This is really what it does:
--- f:\dd\vctools\crt\crtw32\startup\amd64\chkstk.asm ---
mov qword ptr [rsp],r10
mov qword ptr [rsp+8],r11
mov r11,qword ptr gs:[10h]
jae cs10+10h (018104AA40h)
mov byte ptr [r11],0
jne cs10 (018104AA30h)
mov r10,qword ptr [rsp]
mov r11,qword ptr [rsp+8]
Key sentence of the second linked article seems to be: "The parameter in rax is size of data you want to add." In my case, eax is set to 26B29h = 158505. Wait, what?! This is more than 150 KB! Is it really how much of the stack the function needs?!
It was finally the right conclusion. The function was more than 3000-lines long, with lots of nested conditions and all kinds of stuff, but mostly an all-encompassing switch with dozens of different cases. I refactored it, extracting code from under each case to a separate function. This fixed the "Stack Overflow" crash.
Apparently if you have a long function and define a lot of local variables, even if they are not particularly big and they are placed inside separate scopes like if-s or switch case-s, the function may need as much as 150 KB of stack frame, at least in Debug configuration. This can cause crash with "Stack Overflow" message even without infinite recursion or bugs like that. So please keep this in mind as additional argument for refactoring your code as soon as you see the need for it.