Posted by Sergei Glazunov and Mark Brand, Project Zero
Introduction
It was a normal week in the Project Zero office when we got an interesting email from the Chrome team — they’d been looking into a serious crash that was happening occasionally on Android builds of Chrome, but hadn’t made much progress. The same crash had then briefly reproduced on ClusterFuzz; with a test-case which referenced an external website — but it wasn’t reproducing any more, and it seemed likely that the next step was going to be to wait until the bug started reproducing again.
We took a quick look at the details they had, and decided that this issue looked important enough for us to spend some time helping the Chrome team to figure out what was happening. A large part of our motivation here came from the concern that perhaps this external website was intentionally triggering a vulnerable code path (spoiler alert: it wasn’t). The issue also looked to be readily exploitable — the ASAN trace that we had showed an out-of-bounds heap write with what was likely data read from the network.
Although the networking code in Chrome has been split out into a new service process, the implementation of strict sandboxing for that process isn’t completed yet, so it’s still a highly privileged attack surface. This means that this bug alone would be enough to both get initial code execution, and to break out of the Chrome sandbox.
We’re writing this blog post mini-series to illustrate the difficulties that even experienced researchers sometimes face when trying to understand a vulnerability in a complex piece of code. This story in particular has a happy ending, and we were able to help the Chrome team to find and fix the issue, but hopefully the reader will also see that persistence perhaps played more of a part here than wizardry.
Chapter 1: The Test-case
So we have the following rather simple test-case:
<script>
window.open("http://example.com");
window.location = "http://example.net";</script>
Observant readers will no doubt notice that this is a pretty, well, boring output for a fuzzer — all this is doing is loading a couple of webpages! Perhaps this is a good simulation of user behaviour, and this kind of test-case is perhaps a good way to shake out network-stack bugs?
According to the thread, this had now stopped reproducing, so all we had was the ASAN backtrace from when ClusterFuzz first triggered the issue:
=================================================================
==12590==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x8e389bf1 at pc 0xec0defe8 bp 0x90e93960 sp 0x90e93538
WRITE of size 3848 at 0x8e389bf1 thread T598 (NetworkService)
#0 0xec0defe4 in __asan_memcpy
#1 0xa0d1433a in net::SpdyReadQueue::Dequeue(char*, unsigned int) net/spdy/spdy_read_queue.cc:43:5
#2 0xa0d17c24 in net::SpdyHttpStream::DoBufferedReadCallback() net/spdy/spdy_http_stream.cc:637:30
#3 0x9f39be54 in base::internal::CallbackBase::polymorphic_invoke() const base/callback_internal.h:161:25
#4 0x9f39be54 in base::OnceCallback<void ()>::Run() && base/callback.h:97
#5 0x9f39be54 in base::TaskAnnotator::RunTask(char const*, base::PendingTask*) base/task/common/task_annotator.cc:142
...
#17 0xea222ff6 in __start_thread bionic/libc/bionic/clone.cpp:52:16
0x8e389bf1 is located 0 bytes to the right of 1-byte region [0x8e389bf0,0x8e389bf1)
allocated by thread T598 (NetworkService) here:
#0 0xec0ed42c in operator new[](unsigned int)
#1 0xa0d52b78 in net::IOBuffer::IOBuffer(int) net/base/io_buffer.cc:33:11
Thread T598 (NetworkService) created by T0 (oid.apps.chrome) here:
#0 0xec0cb4e0 in pthread_create
#1 0x9bfbbc9a in base::(anonymous namespace)::CreateThread(unsigned int, bool, base::PlatformThread::Delegate*, base::PlatformThreadHandle*, base::ThreadPriority) base/threading/platform_thread_posix.cc:120:13
#2 0x95a07c18 in __cxa_finalize
SUMMARY: AddressSanitizer: heap-buffer-overflow (/system/lib/libclang_rt.asan-arm-android.so+0x93fe4)
Shadow bytes around the buggy address:
0xdae49320: fa fa 04 fa fa fa fd fa fa fa fd fa fa fa fd fa
0xdae49330: fa fa 00 04 fa fa 00 fa fa fa 00 fa fa fa fd fd
0xdae49340: fa fa fd fd fa fa fd fa fa fa fd fd fa fa fd fa
0xdae49350: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
0xdae49360: fa fa fd fd fa fa fd fa fa fa fd fd fa fa fd fd
=>0xdae49370: fa fa fd fd fa fa fd fd fa fa fd fa fa fa[01]fa
0xdae49380: fa fa fd fa fa fa fd fa fa fa fd fd fa fa fd fd
0xdae49390: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
0xdae493a0: fa fa fd fd fa fa fd fa fa fa 00 fa fa fa 04 fa
0xdae493b0: fa fa 00 fa fa fa 04 fa fa fa 00 00 fa fa 00 fa
0xdae493c0: fa fa 00 fa fa fa 00 fa fa fa 00 fa fa fa 00 fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==12590==ABORTING
This looks like a really serious bug; it’s a heap-buffer-overflow writing data that likely comes directly from the network. We didn’t, however, have a development environment for Chrome on Android available, so we decided to try and find the root cause ourselves, figuring that it couldn’t be too hard to find a place where the size of an IOBuffer gets confused.
Chapter 2: HttpCache::Transaction
Since the bug was no longer reproducing on ClusterFuzz, we initially assumed that something had changed in the webserver or network configuration, and started looking into the code.
Tracking back where the IOBuffer being written to in SpdyHttpStream::DoBufferedReadCallback comes from, we’re likely looking for a call site of HttpNetworkTransaction::Read where the size of the IOBuffer passed as the parameter buf doesn’t match the length passed as buf_len. There aren’t all that many call sites; but none of them looked immediately wrong, and we spent a few days going back and forth with hopeless theories.
Perhaps our first mistake was to then start trying to collect more information on the bug by investigating in Chrome’s crash dump repository — it turns out that there are quite a few crashes with similar stack-traces that we simply couldn’t explain — and that didn’t help to explain this bug! After perhaps a week of investigation we had a huge collection of related crashes and had made no progress at all towards finding the root cause of this issue.
We were getting increasingly close to giving up in frustration when we found the following line of code in HttpCache::Transaction::WriteResponseInfoToEntry (which we must both have skimmed past multiple times by this point… and for some reason not noticed):
// When writing headers, we normally only write the non-transient headers.
bool skip_transient_headers = true;
scoped_refptr<PickledIOBuffer> data(new PickledIOBuffer());
response.Persist(data->pickle(), skip_transient_headers, truncated);
data->Done();
io_buf_len_ = data->pickle()->size();
// Summarize some info on cacheability in memory. Don’t do it if doomed
// since then |entry_| isn’t definitive for |cache_key_|.
if (!entry_->doomed) {
cache_->GetCurrentBackend()->SetEntryInMemoryData(
cache_key_, ComputeUnusablePerCachingHeaders()
? HINT_UNUSABLE_PER_CACHING_HEADERS
: 0);
}
This looks highly suspicious; in other locations in the same file, there is a clear expectation that io_buf_len_ matches the size of the IOBuffer read_buf_; and indeed, this assumption is used in a call that would lead to a Read call:
int HttpCache::Transaction::DoNetworkReadCacheWrite() {
TRACE_EVENT0("io", "HttpCacheTransaction::DoNetworkReadCacheWrite");
DCHECK(InWriters());
TransitionToState(STATE_NETWORK_READ_CACHE_WRITE_COMPLETE);
return entry_->writers->Read(read_buf_, io_buf_len_, io_callback_, this);
}
It certainly matched everything we knew about the bug, and at this point seemed by far and away the best lead we had — but it’s not trivial to reach this code in an interesting way. The HTTP cache implements a state machine with ~50 different states! This state machine is usually run twice during a request; once when the request is started (HttpCache::Transaction::Start) and again when reading the response data (HttpCache::Transaction::Read). In order to reach this bug, we’d need a loop in the state transitions that could take us from one of the Read states, back into a state that can call WriteResponseInfoToEntry and transition back into reading the data without updating the read_buf_ pointer; so we’re focussed on the second run of this state machine; that is, the states that are reachable from the Read call.
WriteResponseInfoToEntry has 4 call sites, all in state handlers:
DoCacheWriteUpdatedPrefetchResponse
DoCacheUpdateStaleWhileRevalidateTimeout
DoCacheWriteUpdatedResponse
DoCacheWriteResponse
We need to first establish whether there’s a transition path from HttpCache::Transaction::Read into each of these states, since otherwise we won’t have a previous value for read_buf_ and io_buf_len_.
Since it’s a bit difficult to reason about the state machine transitions by looking at the code, we’ve prepared a graph approximating the state machine, which will make it clear and simple for the reader to understand.
It would have been sensible to do this initially; but we originally just manually performed a depth-first-search of the state machine in the source code, which was error-prone and confusing.

The four states marked in yellow are the states which could alter the value of io_buf_len_; the states which would then use this corrupted io_buf_len_ value are the three child states of TransitionToReadingState: CACHE_READ_DATA, NETWORK_READ, and NETWORK_READ_CACHE_WRITE, which are marked in green.
Posting Komentar