Monthly Archives: September 2015

An Introduction to Object Storage

What is object storage? Object storage is a term used to describe a new storage paradigm. It was created to address new challenges that we see starting to see around the large growth in unstructured data. Today’s traditional storage technologies aren’t able to scale in order to deal with this growth which some analysts are saying well grow 40% in the next 10 years. But before we go any deeper with object storage. Let’s get a refresher in file and block based storage architectures. It’s important to understand these technologies first and how cloud storage is different.

SAN Storage – This technology has been around for decades and has been the foundation for both direct-attached-storage (DAS) and storage-area-networks (SAN) technologies. In Block storage, blocks were numbered and stored in a table. The OS would reference the table to access the appropriate block(s). In Windows this was a FAT or NTFS and in the UNIX world is was called Superblock. This model was limited to the OS or kernel level. The challenge here was scalability from the server and file system perspective. Storage arrays addressed some of the challenges by centralizing storage and allowing for more growth but we still had operating system and file system constraints.

 

NAS Storage – This technology presents files over the network using SMB (CIFS) and NFS. It still references block and uses a file system like WAFL, ZFS, etc. This model functions at the user level versus the host level. Think of it being one layer above the SAN storage example above. You still need a block storage device but you are using a protocol like SMB or NFS to access the data and not iSCSI or Fiber Channel. Compared to SAN, NAS storage is typically easier to manage can can hide some of the complexities of SAN or block based storage from the user and administrator. Plus, it you weren’t constrained at the OS or server level with this particular solution. Now that NAS performance is comparable to iSCSI and FC it is a great option for a lot of workloads.

Object Storage –This is the new paradigm. Similar to NAS but uses objects not files and makes then available via HTTP using a protocol called REST. REST is a lightweight protocol used for accessing data over the web. REST has the ability to send command operators (GET, PUT, DELETE, etc.) via HTTP. It still uses block storage under the covers much like NAS but in a much simpler format. It also uses a different form of data protection to address come challenges in the older RAID technology. In addition, it allows multiple software platforms to access data without dealing with the complexities of FC, iSCSI, NFS, or SMB. So from a software developers perspective it makes access data a whole lot simpler.

High-level View

Your typically object storage environment consists of a cluster of Linux server or appliances (sometimes called nodes) with a load balancer. When a request comes from the client the load balancer (depending on the algorithm) will determine which node will receive the request. In this example, there are six nodes. The request comes in and is sent to node one. The client will write the file to node one and it will make multiple copies of this file to node 3 and node 6. This provides redundancy and replaces the need for RAID in this solution. This is called a 3 Object Copy. All Copies of files are available for both read and write functions which is important. These nodes could be in the same data center or be in geography different locations. This is a fairly simplified design but it gives you the high level view of how object storage works.

What is REST?

We briefly touched on REST earlier in this post. Representational State Transfer (REST) is a newer protocol that was originally designed to solve some of the challenges in the world of web development. More specify around the better ways of doing web services and the challenges around using protocol like Simple Object Access Protocol (SOAP). What made REST appealing for Object Storage was it’s extremely lightweight and high customizable nature. Amazon S3 is a customized version of REST that was developed by Amazon Web Services (AWS). S3 is quickly becoming the protocol of choice for object storage. There are other protocols out there like Cloud Data Management Interface (CDMI) jointly developed by SNIA and the OpenStack Object Storage protocol or SWIFT but it’s clear for now S3 is the leader.

Protection Types:

Replica – Replica is the object copy model and is the most rudimentary protection method. For every file written to a node in the cluster, two more copies are written to other nodes in the cluster. This is fairly simple to implement but triples the amount of raw capacity needed which isn’t very efficient. The most common is the “three copies” protection scheme. This is still widely used by a lot of cloud provides but doesn’t have a future in the enterprise space.

Erasure Coding – Erasure Coding was a technology develop by NASA for deep space telemetry because of the high rate of signal loss in space. The algorithm was capable of reconstructing a signal, even with 30% to 40% of signal loss. It was later found that this algorithm works well as a data protection scheme in distributed cloud storage environments where you are dealing with the challenges of a WAN connectivity.

How does Erasure Coding work? In an Object storage model, Erasure Coding splits the file into segments and adds a hash to each segment. This hash creates a file protection mechanism (metadata) which doesn’t double to triples file size like with saw with replica model. This model saves disk space which makes it a more cost efficient and better performing model in most deployments. It can also survive several device failures depending on how it is deployed. Because of this protection scheme large object deployments are typically it’s sweet spot.

The challenge with large deployments that span geographic locations is that it requires a lot of backend infrastructure to support this design even with Erasure Coding. Depending on the rate of change is the amount of data that will traverse the WAN could potentially be quite substantial. So replication of data is still typically used for situations were data needs to be sent to a remote location with Erasure Coding taking place at a single location.

NoSQL DB – NoSQL is a non-rational database designed for large scale deployments, where unstructured data are stored across multiple nodes or servers. This is a distributed architecture which sales horizontally as data grows this is no decrease in performance. This is the opposite with rational databases. Rational databases typically require more compute horsepower and only scale vertically meaning you need a larger box when you have consumed all resources. NoSQL originally took off with the growth in Web 2.0 applications but we’re now seeing it be used in big data applications and cloud storage. The ability for NoSQL to scale makes it a good choice for object storage were small objects are used.

Use Cases:

Object Storage is still hasn’t been adopted by the enterprise. Part of the challenge is its limited use cases and the small number of products available on in the market. As mentioned earlier, Object Storage was designed to address a need in software development to access data without the complexities other older technologies and massive growth in unstructured data. We should start to see move adoption on these type of technologies in the next 5 to 10 years. Analysts are predicting unstructured data to grow as much as 20 to 40 times. The only way to address this massive growth is Object Storage. Backup and Archive is another area were we could see Object Storage take off in the next few years. It’s a great way to backup data to disk and start phasing out that tape environment. We are already starting to see a lot of backup vendors developing gateways. You need what is called a gateway server or appliance to convert NAS/SAN based protocols to REST or S3. At some point you’ll see media servers natively support this function as it grows in popularity. This was the same challenge backup vendors had when writing to disk 10 years ago. A VTL has to be used to present disks based storage because backup vendors could only write to tape.

Conclusion

As mentioned earlier, object based storage is a great solution for unstructured data, backup and archive. However, it is not a good option for virtualization, databases, and other high I/O workloads. Block and files based storage that’s Flash optimized is better solutions for these type of workloads. I do see the gap shrinking between object based storage and the more traditional methodologies, e.g. SAN and NAS in the next 10 years.