• Showing posts with label database. Show all posts
    Showing posts with label database. Show all posts

    Friday, May 6, 2011

    Disruptive Cloud Start-Ups - Part 1: NimbusDB

    Being at Under The Radar (UTR), watching disruptive companies present and network with entrepreneurs, thought leaders, and venture capitalists is an annual tradition that I don't miss. I have blogged about disruptive start-ups that I saw in the previous years. The biggest exit out of UTR, that I have witnessed so far, is Salesforce.com's $212 million acquisition of Heroku. This post is about one of the disruptive start-ups that I saw at UTR this year - NimbusDB.

    I met with Barry Morris, the CEO and Founder of NimbusDB at a reception the night before. I had long conversation with him around the issues with legacy databases, NoSQL, and of course NimbusDB. I must say that, after long time, I have seen a company applying all the right design principles to solve a chronic problem - how can you make SQL databases scale so that they don't suck.

    One of the main issues with the legacy relational databases is that they were never designed to scale out to begin with. A range of NoSQL solutions addressed the scale-out issue, but the biggest problem with a NoSQL is that NoSQL is not SQL. This is why I was excited when I saw what NimbusDB has to offer: it's a SQL database at the surface but has radically modern architecture underneath that leverages MapReduce to divide and conquer queries, BitTorrent for messaging, and Dynamo for persistence.

    NimbusDB's architecture isolates transactions from storage and uses asynchronous messaging across nodes - a non-blocking atom commit protocol - to gain horizontal scalability. At the application layer, it supports the "most" of SQL 99 features and doesn't require the developers to re-learn or re-code. The architecture doesn't involve any kind of sharding and the nodes can scale on any commodity machine on a variety of operating systems. This eliminates an explicit need of a separate hot back-up since any and all nodes serve as a live database in any zone. This makes NimbusDB an always live system, which also solves a major problem with traditional relational databases - high availability. It's an insert only database and it versions every single atom/record. That's how it achieves MVCC as well. The data is compressed on a disk and is accessed from an in-memory node.

    I asked Barry about using NimbusDB as an analytic database and he said that the database is currently not optimized for analytic queries, but he does not see why it can't be tuned and configured as an analytic database since the inherent architecture doesn't really have to change. Though, during his pitch, he did mention that NimbusDB may have challenges with heavy reads and heavy writes. I personally believe that solving a problem of analytic query on large volume of data is a much bigger challenge in the cloud due to the inherent distributed nature of the cloud. Similarly, building a heavy-insert system is equally difficult. However, most systems fit somewhere in between. This could be a great target market for NimbusDB.

    I haven't played around with the database, but I do intend to do so. On a cursory look, it seems to defy the CAP theorem. Barry seems to disagree with me. The founders of NimbusDB have great backgrounds. Barry was the CEO of IONA and Streambase and has extensive experience in building and leading technology companies. If NimbusDB can execute based on the principles it is designed on, this will be a huge breakthrough.

    As a general trend, I see a clear transition, where people finally agree that SQL is still a preferred interface, but the key is to rethink the underlying architecture.

    Update: After I published the post, Benjamin Block raised concerns around NimbusDB not getting the CAP theorem. As I mentioned in the post, I also had the same concern, but I would give them benefit of doubt for now and watch the feedback as the product goes into beta.

    Check out their slides and the presentation:



    Tuesday, April 27, 2010

    Delphix Is A Disruptive Database Virtualization Start-up To Watch

    This is my second post on my impressions from the Under The Radar conference. Check out the first post on NoSQL.

    Virtualization is not cloud computing. However virtualization has significant potential when it is used to achieve cloud-like characteristics such as elasticity, economies of scale, accessibility, simplicity to deploy etc. I have always believed that the next wave of cloud computing is going to be all about solving “special” problems on the cloud – I call it a vertical cloud. These vertical problems could be in any domain, technology stack, or industry. Raw computing has come long way. It is about the right time we do something more interesting with the raw cloud computing.

    Delphix is attempting to solve a specific problem - database virtualization. I met the CEO Kaycee Lai and the VP of sales Jedidiah Yueh at Under The Radar reception the night before. They have great background in understanding the cost and flexibility issues around de-duplication from their days at EMC. They have assembled a great team including Alok Srivastava from Oracle who ran Oracle RAC engineering prior to joining Delphix. Most large database deployments have multiple copies of single database that customers use for purposes beyond production such as staging, testing, and troubleshooting. This replication is expensive from process, resources, and storage perspective and takes long time to provision instances. The founders saw this problem first hand at EMC and decided to solve it.

    At the core their offering is a read-write snapshot of a database. That’s quite an achievement. The snapshots are, well, snapshots. You can’t modify them. When you make this compromise they occupy way less space. Delphix took the same concept but created the writable snapshots and a seemingly easy to use application (I haven’t used it) that allows quick de-duplication based on these snapshots. You can also go back in time and start your instance from there.

    Delphix has great value proposition in the database virtualization - help the customers reduce their hardware and people – DBA and system administrators - cost at the same time accelerate the IT processes. I like their conscious decision not to go after the backup market. Sometimes you have a great product but if it is marketed in the wrong category with vendors fighting in the red ocean you could die before you can grow. They had the best pitch at the conference – very calm, explaining the problem, articulating the value proposition, emphasizing right people on the team, and identifying the target market. If you are an entrepreneur (or even if you are not) check out their pitch and Q&A. There is a lot you can learn from them.

    Friday, March 5, 2010

    NoSQL Is Not SQL And That’s A Problem

    I do recognize the thrust behind the NoSQL movement. While some are announcing an end of era for MySQL and memcached others are questioning the arguments behind Cassandra’s OLTP claims and scalability and universal applicability of NoSQL. It is great to see innovative data persistence and access solutions that challenges the long lasting legacy of RDBMS. Competition between HBase and Cassandra is heating up. Amazon now supports a variety of consistency models on EC2.

    However none of the NoSQL solutions solve a fundamental underlying problem – a developer upfront has to pick persistence, consistency, and access options for an application.

    I would argue that RDBMS has been popular for the last 30 years because of ubiquitous SQL. Whenever the developers wanted to design an application they put an RDBMS underneath and used SQL from all possible layers. Over a period of time the RDBMS grew in functions and features such as binary storage, faster access, clusters etc. and the applications reaped these benefits.

    I still remember the days where you had to use a rule-based optimizer to teach the database how best to execute the query. These days the cost-based optimizers can find the best plan for a SQL statement to take guess work out of the equation. This evolution teaches us an important lesson. The application developers and to some extent even the database developers should not have to learn the underlying data access and optimization techniques. They should expect an abstraction that allows them to consume data where consistency and persistence are optimized based on the application needs and the content being persisted.

    SQL did a great job as a non-procedural language (what to do) against many past and current procedural languages (how to do). SQL did not solve the problem of staying independent of the schema. The developers did have to learn how to model the data. When I first saw schema-less data stores I thought we would finally solve the age-old problem of making an upfront decision of how data is organized. We did solve this problem but we introduced a new problem - lack of ubiquitous access and consistency options for schema-less data stores. Each of these data stores came with its own set of access API that are not necessarily complicated but uniquely tailored to address parts of the mighty CAP theorem. Some solutions even went further and optimized on specific consistencies such as eventually consistency, weak consistency etc.

    I am always in favor of giving more options to the developers. It’s usually a good thing. However what worries me about NoSQL is that it is not SQL. There isn’t simply enough push for ubiquitous and universal design time abstractions. The runtime is certainly getting better, cheaper, faster but it is directly being pushed to the developers skipping a whole lot of layers in between. Google designed BigTable and MapReduce. Facebook took the best of BigTable and Dynamo to design Cassandra, and Twitter wanted scripting against programming on Hadoop and hence designed Pig. These vendors spent significant time and resources for one reason – to make their applications run faster and better. What about the rest of the world? Not all applications share the same characteristics as Facebook and Twitter and certainly enterprise software is quite different.

    I would like to throw out a challenge. Design a data store that has ubiquitous interface for the application developers and is independent of consistency models, upfront data modeling (schema), and access algorithms. As a developer you start storing, accessing, and manipulating the information treating everything underneath as a service. As a data store provider you would gather upstream application and content metadata to configure, optimize, and localize your data store to provide ubiquitous experience to the developers. As an ecosystem partner you would plug-in your hot-swappable modules into the data stores that are designed to meet the specific data access and optimization needs of the applications.

    Are you up for the challenge?

    Thursday, August 27, 2009

    SOAP may finally REST

    Lately I have observed significant movement in two transformational trends - adoption of REST over SOAP and proliferation of non-relational persistence options. These two trends complement each other and they are likely to cause disruption sooner than later.

    The enterprise software that required complex transactions, monitoring, and orchestration capabilities relied on the SOAP-based architecture and standards to realize their SOA efforts. The consumer web on the other side raced towards embracing RESTful interfaces since they were simple to set up and consume. There are arguments on both the sides. However, lately the market forces have taken the side of REST even if REST has significant drawbacks in the areas such as security and transactions. This once again proves that a simple and good enough approach that conforms to loose contracts outweighs a complex solution that complies to stricter standards even if it means compromising certain critical features. The web is essentially an unreliable stateless medium and any attempts to regulate it is less likely to work in our favor.

    Many argue that the self-describing standards for SOAP are its strength over the RESTful services that lacks such features. However designing a RESTful service is fairly trivial since it allows to learn and experiment by being iterative unlike a relatively complex upfront learning process associated with the SOAP-based architecture. There has been a flurry of activities in the messaging middleware by Google that makes these RESTful interface even more compelling. This includes Google Wave Federation and PubSubHubbub. The developers are more likely to prefer these messaging protocols against SOAP and that would mean more RESTful APIs in the Pushbutton Web. Easy consumability reduces the initial adoption barrier and that's the key to success in many cases.

    Since I last blogged about the continuum of the database on the cloud from schemaless to full-schema new persistence options have emerged such as RethinkDB and HadoopDB and many debates have spurred questioning the legacy of the RDBMS. For a cloud-like environment the statelessness, ad hoc persistence design, and instantaneous horizontal scale go well with the RESTful architecture. The growing popularity of SimpleDB and CouchDB along with many discussions on how to achieve CRUD with REST signal that the persistence is becoming more RESTful and schemaless.

    I was convinced quite some back that REST was certainly the future for the consumer web but the latest trends have made me believe that the REST will see its adoption in the enterprise software accelerated much sooner than I had originally expected. This is like Java and Internet; the organizations embraced Java and the Internet at the same. The same will be true for the cloud and REST. When the companies consider moving to the cloud they will reconsider their SOA and persistence strategy and will likely adopt REST and alternate persistence models.

    The cloud might be the last nail in the SOAP coffin.

    Friday, April 24, 2009

    Database Continuum On The Cloud - From Schemaless To Full-Schema

    A recent paper by Mike Stonebraker and others compared relational and columnar database in a parallel configuration with MapReduce. The paper concludes that MapReduce is an easy to configure and easy to use option where as the other data stores, relational and columnar databases, pay the upfront price of organizing the data but outperform MapReduce in the runtime performance. This study does highlight the fact that a chosen option does not necessarily dictate or limit the scale as long as the other attributes such as an effective parallelism algorithm, B-tree indices, main-memory computation, compression etc. can help achieve the desired scale.

    The real issue, which is not being addressed, is that even if the chosen approach does not limit the scale it still significantly impacts the design-time decisions that developers and architects have to make. These upfront decisions limit the functionality of the applications built on these data store and reduces the overall design-agility of the system. Let's look at the brief history of the evolution of DBMS, a data mining renaissance, and what we really need to design a data store that makes sense from the consumption and not the production view point.

    Brief history of evolution of DBMS

    Traditionally the relational database systems were designed to meet the needs of transactional applications such as ERP, SCM, CRM etc. also known as OLTP. These database systems provided row-store, indexes that work for selective queries, and high transactional throughput.

    Then came the BI age that required accessing all the rows but fewer columns and had the need to apply mathematical functions such as aggregation, average etc. on the data that was being queried. Relational DBMS did not seem to be the right choice but the vendors figured out creative ways to use the same relational DBMS for the BI systems.

    As the popularity of the BI systems and the volume of data grew two kinds of solutions emerged - one that still used the relational DBMS but accelerated the performance via innovative schema and specialized hardware and the other kind, columnar database, that used column-store instead of row-store. A columnar DBMS stores data grouped in columns so that a typical BI query can read all the rows but fewer columns in single read operation. Columnar vendors also started adding compression and main-memory computation to accelerate the runtime performance. The overall runtime performance of BI systems certainly got better.

    Both the approaches, row-based and columnar, still required ETL - a process to extract data out of the transactional systems, apply some transformation functions, and load data into a separate BI store. They did not solve the issue of "design latency" - upfront time consumed to design a BI report due to the required transformation and a series of complicated steps to model a report.

    Companies such as Greenplum and Aster Data decided to solve some of these legacy issues. Greenplum provides design-time agility by adopting a dump-all-your-data approach to apply the transformation on the fly only when needed. Aster Data has three layers to address the query, load, and execute aspects of the data. These are certainly better approaches that uses the parallelism really well and has cloud-like behavior but are still designed to patch up the legacy issues and do not provide clean design-time data abstraction.

    What do we really need?

    MapReduce is powerful since it is extremely simple to use. It has only three functions - map, split, and reduce. Such schemaless approaches have lately grown popularity due to the fact that developers don't want to lock themselves into a specific data model. They also want to explore adhoc computing before optimizing the performance. There are also extreme scenarios such as FriendFeed using relational database MySQL to store schema-less data. MapReduce has very low barrier to entry to get started. On the other hand a fully-defined schema approach by relational and columnar DBMS offers great runtime performance once the data is loaded and indexed for transactional access and executing BI functions such as aggregation, average, mean etc.

    What we really need is a continuum from a schemaless to a full schema database based on the context, action, and access patterns of the data. A declarative abstracted persistence layer to access and manipulate the database that is optimized locally for various actions and access patterns is the right approach. This will allow the developers to fetch and manipulate the data independent of the storage and access mechanism. For example, developers can design an application where single page can perform a complex structured and unstructured search, create a traditional transaction, and display rich analytics information from single logical data store without worrying about what algorithms are being used to fetch and store data and how the system is designed to scale. This might require a hybrid data store architecture that optimizes the physical storage of data for certain access patterns and uses redundant storage replicated in real-time and other mechanisms such as accelerators for other patterns to provide unified data access to the applications upstream.

    Schemaless databases such as SimpleDB, CouchDB, and Dovetail are in their infancy but the cloud makes it a good platform to support the key requirements of schemaless databases - incremental provisioning and progressive structure. Cloud also makes it a great platform for the full-schema DBMS by offering utility-style incremental computing to accelerate the runtime performance. A continuum on the cloud may not be that far-fetched after all.

    Tuesday, March 4, 2008

    Open source licenses and its impact on commercialization

    The choice of an open source license sparks a debate from time to time and this time around it is about using GPL as a strategic weapon to force your competitors to share their code versus use BSD to have faith in your proprietary solution as an open source derivative to reduce the barrier to an entry into the market. I agree with the success of mySQL but I won’t attribute the entire success to the chosen license. Comparing open source licenses in the context of commercializing a database is very narrow comparision. First of all PostgreSQL and mySQL are not identical databases and don’t have the exact same customers and secondly I see database as enabler to value add on top of it. EnterpriseDB is a great example of this value add and I think it is very speculative to say whether it is an acquisition target or not – the real question is would EnterpriseDB have accomplished the same if PostgreSQL used GPL instead of BSD.

    I see plenty of opportunities in the open source software license innovation and over a period of time disruptive business models will force the licenses to align with what business really need. IP indemnification of GPL v3 is a classic example of how licenses evolve based on the commercial dynamics amongst organizations. We can expect the licenses to become even more complex with wide adoption of SaaS delivery models where a vendor is not shipping any software anymore.

    People do believe in open source but may not necessarily believe the fact that they have a legal obligation to contribute back to the open source community every time they do something interesting with it and Richard Stallman would strongly disagree. The companies such as BlackDuck has a successful business model on the very fact that vendors don’t want to ship GPLed code. We should not fight the license, just be creative, embrace open source, and innovate!

    Monday, February 11, 2008

    Data encryption as a new class of DoS

    Not to sure what to make out of this argument. Experts from IBM Internet Security Systems, Juniper, nCipher argue that data encryption is a new class of DoS. The post says "It's a new class of DoS attack.. If you can go in and revoke a key and then demand a ransom, it's a fantastic way of attacking a business." This does not make any sense. If someone can get your private key revoked you would have a lot to worry about other than data encryption.

    It also says "Another risk is that over-zealous use of encryption will damage an organization's ability to legitimately share and use critical business data". The storage is encrypted but the access is not, so I am not sure what sharing issues the post is talking about. The leading database vendors such as Oracle provides column level encryption where data is encrypted before it is stored but it is decrypted on-the-fly when accessed and it is very transparent to the user or to the application. Though a limited set of real-time data should be encrypted since there is an overhead of decryption every time the data is accessed and the physical and digital security of the real-time data store is much better than an off-line storage such as backup tapes . On the other hand the backups should always completely be encrypted because they are not supposed to be accessed in real time and there is a greater risk of loosing a tape from a UPS truck or get stolen by baggage handlers. In fact Oracle once considered not to allow taking unencrypted backups at all.

    What really matters is the encryption strategy of the organization for the data accessed in real time and the data that gets backed up on a tape. Some simple key management solutions and the right decisions and governance can solve the supposed DoS problems that are being discussed. You could take any tool and use it a wrong way and then complain about the tool itself. Encryption is just a tool and an enabler and you have to figure out how to use it. If you closely look at the "experts" in the post they are into the key management business and want you to believe that your keys will be revoked one day and you might end up paying ransom and also risk your data so why not pay us now and buy our software.
  • e 的彩

    狗博直营 娱二乐