OneDrive + Deming + Quality

Emails sanitized and posted with permission.


From: Erik Gavriluk <[email protected]>
Date: Sat, Jun 20, 2015 at 12:16 PM
Subject: OneDrive
To: Joe Belfiore

Heads up on OneDrive — this thing is really not good. Obviously one person’s anecdata isn’t a useful survey but I’ve seen numerous behaviors that reveal a fundamental lack of engineering process. It fails in common scenarios and breaks spectacularly in slightly-uncommon ones like copying from a Unix-based file system.

I’m sure you’re swamped but this is the kind of glitch that could derail what otherwise appears to be an amazing upcoming release of Windows.

I don’t cry wolf and I feel obligated to sound the alarm. Everyone I know is having problems with this damn thing. It’s really bad.

Is there someone I can talk to?


From: Joe Belfiore
Date: Mon, Jun 22, 2015 at 2:31 AM
To: Erik Gavriluk
Cc: Dick

Thanks for sending.

Dick on the CC line runs program management for OneDrive, and he’s interested in hearing more specifics. Getting sync reliability (healthy clients, effective service-side scaling, etc. etc.) has been a big priority for the team and they are rolling out a bunch of work in that dimension.  Your timing to ask is good.


From: Erik Gavriluk
Date: Fri, Jun 26, 2015 at 1:47 AM
To: Dick, Joe Belfiore

Hi Dick,

I spent the past few days contemplating a response. I would like to commend your team for reaching out to me and particularly their positive attitude in doing so. It says a lot. I pruned my response back to just you and Joe due to tone, which I apologize for in advance.

Sync is an odd blind spot for Microsoft. I swear I never had a Microsoft product sync correctly. Calendars in Outlook 97, files on the PocketPC, contacts in Windows Phone 5 and 6 (and 6.1, and 6.5…), songs on Zune, saves on Xbox, my tax returns on Windows Home Server, entire databases via Azure SQL Sync.

I can barely list all the platforms I’ve lost data on let alone the numerous times I’ve lost data on each platform. FTP, WebDAV, LDAP, IMAP — and hey, remember that Live Mesh disaster? — if there’s a way to copy data over it, Microsoft’s found a way to mess mine up.

But hey, it’s gotta be me, right?

DropBox. Mozy. Copy. Time Machine. All coded in half-assed languages by second-tier software companies. Never had a problem with any of ’em! The only time I see a popup is when they want me to give them money. Which I do, happily, because their stuff just works.

Windows Backup? Just popped a notification while I was typing this saying it couldn’t find the server. Meanwhile Mozy’s changed corporate ownership twice since I’ve been a subscriber yet it just keeps working.

Just glanced down at the notification bar and now I have two OneDrive notification icons for some reason. (Hovered over one and it silently disappeared.)

I simply do not understand.

Sync is hard! I certainly don’t have to tell you that. But it’s containable! Unless…

Prior to Windows 8.1, we had two sync experiences. One used on Windows 7/8/Mac to connect to the consumer service, and a second sync engine to connect to the commercial service (OneDrive for Business). In Windows 8.1 we introduced a third sync engine…

https://www.microsoft.com/en-us/microsoft-365/blog/2015/01/07/taking-the-next-step-in-sync-for-onedrive/

Oh, good lord.

I don’t know what’s possible to fix ahead of the Windows 10 launch but something has to be done with status and error reporting at a minimum. This is critical. I’ve hand-soldered prototype SCSI interfaces yet I cannot build a mental model of what OneDrive thinks it’s doing while copying data. I saw OLE-style error codes! They had cobwebs on them! (Tip to senior managers: if you see anything like 0x8000a0a3 in end-user facing dialogs that means “Too many technology layers. Please simplify.”)

As best as I understand it — admittedly I’m on Windows 7 and I’m a consumer so I only have a subset of “sync experiences” — OneDrive seems to consist of the following:

  • A slow, buggy file scanning component that uses excessive RAM
  • A headless notification component that seems to miss local file system notifications yet overreact wildly to remote network notifications
  • A glitchy bandwidth throttler that varies between “do nothing at all” and “send so much data it crashes my Airport Extreme” — requiring a power cycle
  • A user interface component straight out of the Windows 3.0 design era that reports strange errors but gives you no help in solving them. (And, after you manually fix the errors, continues to display them.)
  • A system tray widget that only knows how to say “Scanning for changes…” and launches the old “skydrive.live.com” URL when I click on it
  • Plus a web-based client and some backend indexing tech that is truly brilliant and best of breed which is why I’m even bothering to go through this exercise

Dropbox is presently using 125MB of RAM. This is excessive and wrong and always bothered me. OneDrive is presently using 860MB and has been burning 20% or more CPU since mid May. 

[ Update: my sync completed! Now OneDrive is at 25% CPU utilization and using nearly two gigs of RAM. ]

The bugs I’m seeing indicate fundamental design flaws and sloppy engineering. One of your developers mentioned “adding more telemetry.” It’s summer 2015! It should’ve had complete telemetry half a decade ago before it even copied the first byte.

I realize this is not your fault. And yes, we both know that this technology stack has been a career-ender for nearly every person who comes near it. Sync is just something that, for whatever reason, seems incompatible with Microsoft’s method of specifying and building technology. (I have pages there, too, but now’s not the time.)

To wrap this up as briefly as I can: I’ve seen bizarre, bug-report-defying sync behavior on Windows 7, Windows 8 and Mac. I’ve seen it on OneDrive Business and OneDrive: The Consumer Experience (now in 3D IMAX!) I’ve seen it in secure corporate environments with locked down hardware and networks (so I know it’s not malware or my config).

Obviously I’m an outsider and I don’t know what the plan is around Windows 10 but I don’t see any value in risking it on a OneDrive Hail Mary. Just ratchet it all back. 

Clean up status indicators, bandage up the error reporting, and pull everything back to conservative defaults. Copying speeds, caching, retries, RAM utilization, hell — even allowable storage space. Just throttle it all way back. Pull back the marketing. Knock it down two slots in the Save dialogs. Slow down the rollout of “unlimited” storage. Seriously… who greenlit that?

There are no cloud storage wars. Nobody gives a damn about Google Drive. Box? Dead, and Sinofsky’s screwing around over there, so you know it’s double dead. Dropbox is not a real company and we all know it. So let’s not screw up the Windows and Office franchises for the sake of some pretend space race.


From: Erik Gavriluk
Date: Fri, Jun 26, 2015 at 1:33 PM
To: Dick, Adam, Steven, Joe Belfiore

Ha, yes, Exchange was a bright spot! I still couldn’t get my Gmail copied over to Outlook.com though, and maybe that’s more important in 2015. Happy to provide more actionable details. Here’s some of the nuts-and-bolts stuff I pruned from my previous message. Please ignore the tone. Adam and Steve, ask away!

I have two Dropbox accounts, both contain just under 1 TB of data. OneDrive’s shared folder functionality limitation is embarrassing (how did that happen?!) but that’s clearly getting fixed so I figured I’d start migrating.

1.8TB of data on a 4TB drive. Copy everything from D:\Dropbox to D:\OneDrive. How hard can it be?

It took me over seven hours to fix all the “path too long” and “invalid characters in filename” problems. And I’m sure I screwed up some project files in the process. (There was even “robocopy” involved — that thing is maximally stupid, too.) I had to boot a Linux shell to speed up the process. This is madness.

The user interface couldn’t have been worse. It would show a few errors and give me no way to launch the problem folders. Then I’d manually fix and it would start scanning again… sometimes. Minimum of 30 minutes before the next batch of errors appear. And repeat.

Ergonomics aside the technical limitation is simply unacceptable. We got long filenames working on DOS… surely you guys can figure this out. 

My files are on the disk! Twice! They’re already synced in DropBox! You simply have to superset the DropBox and filesystem functionality. There is no other option. There is no debate. “Restricted characters?” No! The files are already on my disk! How can you ship a cloud service in 2015 that has less functionality than a DOS filesystem?

Anyway, that battle complete, I looked at my OneDrive tray icon, it said “Up to date.” Dragged the files in, and… “Up to date.” Waited a half hour. “Up to date.” Rebooted, went to bed, the next morning it was “Scanning for changes” again.

That was Wednesday June 17. The PC is on 24/7. I don’t use it very much. It’s on a gigabit switch. Speedtest.net just informed me of a 77Mbps upload rate and 11ms ping time. As of Friday at 8pm it had not yet so much as created the destination folder on the server.

On Saturday it finally started copying files. I pulled up onedrive.live.com on my Mac and saw the folder, finally. Saw a picture I liked, accidentally clicked “delete” instead of “download” in the web interface, then clicked “undo.”

The PC immediately went from copying to “Scanning for changes…” and sat there doing nothing for 24 hours.

145,000 files. 980 GB. “Properties” on the Dropbox and/or OneDrive folder gives me these stats in under a minute. 

What the hell is it doing? Is it hashing? Thumbnailing? Text indexing? Chatting with the server? Do I have too many files? Does it not like some of my media files? 

Couldn’t tell you, because there’s no status. No progress bar. No “Estimated completion: July 7th.”

It’s all so primitive and weird and half-assed.

[ Update: once it started, it copied 995.39GB in under five days. I consider this spectacular. ]

Prior to May, my experience with OneDrive was Mac only. The thing wasn’t even close to working on MacOS until the most recent updates. I’m sure you know this; I’ll spare you the details. But it was… bad. Really bad. Unprofessional, brand-killing bad. Even had an icon that drew incorrectly on the title bar. Multiple updates, but still with the wonky icon. That stuff is important. 

Ever watch someone close a big sale with bad product but really white teeth? If nothing else, at least get the teeth right. And hey, just snapped a grab of the dual notification icons on Windows:


From: Erik Gavriluk
Date: Sat, Jun 27, 2015 at 2:15 AM
To: Adam, Dick, Steven, Joe Belfiore

Now y’all are just pranking me.


From: Erik Gavriluk
Date
: Sat, Jun 27, 2015 at 2:30 PM
To: Adam, Dick, Steven, Joe Belfiore

Thanks for the awesome reply, Adam. I’ll send logs and tech notes to you directly.

I would like to make a gentle caution about prioritizing program management and engineering around things like “0.14% of our active users hit this error…” though.

There’s a guy named W. Edwards Deming and everyone on this email should read him. If OneDrive were rock stable then you can sort data like that. But in OneDrive’s current state you need to focus on WHY 0.14% of the people are having that problem (hint: the answer isn’t buffer sizes) and address the problem holistically. This actually explains why it’s the #3 biggest feedback issue given the tiny failure rate. The natural inclination is to ignore that fuzzy number and focus on the hard data. But even a quick skim of Deming’s Wikipedia page will convince you that’s not the way to go.

I’ll beat this drum for one more paragraph because it’s important: I can’t tell you how much hard data there was around the first Xbox controller. Everybody internally knew it was awful, but one guy had all the data, and, well, now they teach it as an example of what not to do in business classes. Sinofsky endorsed an insane 6,000 page blog post talking about what people do in Windows Explorer. In the hard data “Open a File” didn’t even make the top 10. Four years later he’s still typing long posts — but now on LinkedIn trying to convince people he didn’t single-handedly bring an entire industry to a screeching halt. The resulting interface was such a disaster that Dell now points their screens away from consumers, and prefers to show empty wallpaper rather than even hint that their machines run Windows. Why? Because people still can’t figure out how to open their shit.

Don’t be a Sinofsky.

And don’t let my nonsense bring you down! The browser portion of OneDrive is already amazing. The transfer and storage component makes the competition look like toys. There are 50+ companies working on some idiotic subset of this in music, photos, legal documents — basically every MIME type gets its own startup that pretends to solve this problem. Once you guys get it stable you’re not going to believe what people do with it as a platform.

You will win and only Microsoft knows how to win that side of the game.


From: Adam
Date: Sun, Jun 28, 2015 at 1:37 PM
To: Erik Gavriluk
Cc: Dick, Steven, Joe Belfiore

Thank you Erik for turning me on to W. Edwards Deming – I was blown away by even a quick read of Wikipedia by how applicable his ideas are to the problem of sync, where the instantaneous state of the quality of the system is inherently unknowable.  Made me think of some mistakes we have made in the past – trying to get perfect and truly comprehensive data on one slice of the problem and miss entire sections of the experience.  Result is we don’t really measure quality correctly.  Anyway there is lots here and I am hungry to learn more.  Wish I knew of him earlier.  Just ordered his 1993 book.