Thursday, December 08, 2005

A virtual drive in every pot

I've been mulling over the idea of a virtual drive on the Internet for some years now. Witness my ham-handed efforts at http://zfs.sourceforge.net for an early example. Well, I recently resurrected the idea of writing something like it now and it's interesting to see how my thoughts have evolved.

ZFS as I initially envisioned it was to be a network of automatically replicating file servers and the use case in my mind was a university file server. There would be a mapping of many users to a single (virtual) server, with the system (internally a cluster) having to scale to handle as close to an infinite number of users as possible.

Lately, I've been thinking more along the lines of writing a 'Net Drive' type application. A virtual disk I can mount from any machine connected to the Internet and treat as a local drive. Companies like xDrive and iDrive and MangoSoft already offer something of the sort. However, their offerings are targeted more towards the business user. I personally feel there's a massive untapped market of casual users who might be interested.

Imagine having 1GB of space available to you online and directly accessible via a virtual drive. Directly save documents, media files etc. to the virtual drive and access it from anywhere through another computer with the same drive mounted in or through a web interface. Share your password with several people and have them save in the same drive if you wish, give them a URL to the data on your drive or just share certain folders. Boom, you've just eliminated the need for a hard disk on our PC. Internet appliances, here we come!

This type of application would be ideal for someone like Google to create and I can see them stepping into this field sometime soon. It's a classic Google app. You need to scale almost infinitely, but that's easy because you can create slices of the virtual resource and limit the number of users accessing each slice. Want to support more users? Add more slices.

Take Gmail as an example. It's probably got hundreds of millions of users, but unlike the University use case, the users have a many to many (or from another perspective, a one to one) relationship with the system. That is, unlike university students, gmail users are not interested in checking other people's mail or accessing a common email account, or even sharing their email account. This makes it much easier to slice up the virtual space, assigning a limited number of users to each slice and scaling the slices. So Gmail is probably made up of thousands of individual computers, each supporting let's say 1000 users, fronted by an authentication cluster. When a user wants to log in, he goes to gmail.com, is authenticated and then redirected to the individual machine he shares with 999 other people. If you want to add another 1000 users, plug in another machine. You can keep scaling horizontally till infinity for all practical purposes. The authentication datastore will eventually become a bottle-neck, but you can support a enormous number of users before you hit that wall. *

If Google were making a virtual drive, they'd do something similiar. As new users signed up, they'd be assigned to different machines, upto a certain max cap. Just like gmail, users have a one to one relationship with their account. That is, they're only interested in the contents of their accounts and have no need to access anyone elses account or a common store. This makes it trivial to scale exponentially.

Now this is a great product to make and market, except for one small problem; there are already a whole bunch of people out there doing the same thing. So we need to differentiate ourselves from the pack.

One way to do that is to add tagging support to the virtual drive. You can tag files and folders and view virtual 'tag' folders with links to those files. Mainstream OS's don't have a tagging mechanism for files, so we'll have to add meta-data through file names. e.g. end file names with a special character and the tags (i.e. myfile.txt#work,proposal,text) which will be stripped off before being saved to the virtual drive. Users can also publicly 'share' tags.

Other features we can offer are:

  1. Fast file indexing and searching and maybe even mapping/linking files to each other based on content etc.
  2. Clients for hand-helds with disconnected operational ability
  3. Single-click integration with Flikr, Del.icio.us etc.
  4. Rsync based transfers
How are you going to pay for all this? Advertising. Have the virtual drive folder show text/banner ads and the website as well. Have premium accounts and dedicated machines for business users.

Who knows, I might work on this idea... or maybe not.

* Correction: It's possible to avoid turning the authentication store into a bottle-neck. One way to do this would be to have the store for a particular set of users reside on the machine assigned to them. So when you want to access the virtual drive abc.virtualdrive.com, you go to that URL and send in your username and password. If the authentication process running on that machine can't find the user, the login attempt fails.

This just leaves the DNS server as the bottleneck now :-)

1 comment:

Anonymous said...

You said:
"The authentication datastore will eventually become a bottle-neck"

I disagree. This would only be true if access to authenication data store is bottleneck-shaped. The datastore just needs to be designed as carefully as the rest of the file system.

For instance, the authentication datastore needs to be hashed across multiple machines. Along with this, there needs to be a set of frontend servers (horizontally extendable of course), a server-based algorithm that determines which server to contact for authentication, and a client-side algorithm to determine which server to contact to determine the authenticating server. Should be fairly trivial to do this.