Cassandra 0.7 Can Pack 2 Billion Columns Into a Row

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Cassandra 0.7 Can Pack 2 Billion Columns Into a Row 235

Posted by timothy on Sunday January 16, 2011 @08:58PM from the but-only-if-they're-really-thin dept.

angry tapir writes "The cadre of volunteer developers behind the Cassandra distributed database have released the latest version of their open source software, able to hold up to 2 billion columns per row. The newly installed Large Row Support feature of Cassandra version 0.7 allows the database to hold up to 2 billion columns per row. Previous versions had no set upper limit, though the maximum amount of material that could be held in a single row was approximately 2GB. This upper limit has been eliminated."

This discussion has been archived. No new comments can be posted.

Cassandra 0.7 Can Pack 2 Billion Columns Into a Row

Load All Comments

Search 235 Comments Log In/Create an Account

Comments Filter:

Typical applications? (Score:3, Interesting)

by oldhack ( 1037484 ) writes: on Sunday January 16, 2011 @09:01PM (#34900840)

What sorta applications need so many columns? Curious.

Share
twitter facebook
- Re:Typical applications? (Score:5, Funny)
  
  by Brummund ( 447393 ) writes: on Sunday January 16, 2011 @09:12PM (#34900896)
  
  Any application developed by one or more Visual Basic developers, given enough time.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by jrumney ( 197329 ) writes:
    
    Any application developed by one or more Visual Basic developers, given enough time.
    How could that possibly be true, MS Access only supports 255 columns.
    - Re:Typical applications? (Score:4, Funny)
      
      by RobertM1968 ( 951074 ) writes: on Sunday January 16, 2011 @11:23PM (#34901548) Homepage Journal
      
      Any application developed by one or more Visual Basic developers, given enough time.
      How could that possibly be true, MS Access only supports 255 columns.
      And now you understand why Cassandra is so important! :-)
      
      Parent Share
      twitter facebook
      - Re:Typical applications? (Score:4, Informative)
        
        by NFN_NLN ( 633283 ) writes: on Monday January 17, 2011 @01:50AM (#34902146)
        
        Any application developed by one or more Visual Basic developers, given enough time.
        How could that possibly be true, MS Access only supports 255 columns.
        And now you understand why Cassandra is so important! :-)
        In all seriousness I had no idea what Cassandra was or what made it unique as a database. However, I did find this tutorial that others might also find useful:
        http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model [arin.me]
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by Hognoxious ( 631665 ) writes:
        
        Doesn't a name/value pair act for like a field? A record would be several of those that are related to each other. Or maybe not, who knows?
        But in any case, borrowing a term that's already in established usage to mean something superficially similar but significantly different is just fucking retarded. Whoever made that decision ought to be dragged behind a moderately slow horse.
  - Re: (Score:3)
    
    by bieber ( 998013 ) writes:
    
    In all seriousness, I'm horrified to see the potential abuses people will come up with for this.
    
    "Still using MySQL? Man, you need to check out Cassandra! MySQL kept clashing with my every-user-gets-their-own-column architecture..."
    - Re: (Score:2)
      
      by RobertM1968 ( 951074 ) writes:
      
      In all seriousness, I'm horrified to see the potential abuses people will come up with for this.
      "Still using MySQL? Man, you need to check out Cassandra! MySQL kept clashing with my every-user-gets-their-own-column architecture..."
      Wow, that is sloppy. I give each of my users their own table.
      - Re: (Score:2)
        
        by Lehk228 ( 705449 ) writes:
        
        each user gets a system login with a sqlite db in their home directory that holds their account information, posts are appended to "static" HTML files representing each thread, user db's include hyperlinks to each post to view posts by a particular user
      - Re:Typical applications? (Score:4, Insightful)
        
        by AlXtreme ( 223728 ) writes: on Monday January 17, 2011 @05:59AM (#34902936) Homepage Journal
        
        Dear $DEITY, the number of times I've seen (mostly) PHP crapplications use CREATE DATABASE and CREATE / ALTER TABLE, often with ingenious naming schemes, instead of simply inserting new rows. Certain people shouldn't be allowed to touch databases.
        If anyone needs me I'll be sobbing over my coffee.
        
        Parent Share
        twitter facebook
    - Re:Typical applications? (Score:4, Interesting)
      
      by jellomizer ( 103300 ) writes: on Monday January 17, 2011 @08:08AM (#34903298)
      
      I don't think it is a good idea to propose limitation just to stop bad coding practices.
      For 1 the limitations rairly incourage good ones they only make them worse. Eg 254 columns with the 255th pointing to the tablename2 with more data.
      Second by preventing people from doing something stupid they also prevent them from doing something ingenious.
      Third there may be a good reason to do this as well.
      Fourth you make it big enough so you won't need to make it bigger
      
      Parent Share
      twitter facebook
  - Re: (Score:2)
    
    by goombah99 ( 560566 ) writes:
    
    Who the hell cares. I mean whup tee doo. so someone has a larger address space . like wow. for all 12 people with such a bad design that they need 12 billion columns, I'm suite they already figured out how to do have Keyed indexes. why is this on slashdot?
    - Re:Typical applications? (Score:4, Informative)
      
      by mini me ( 132455 ) writes: on Monday January 17, 2011 @01:39AM (#34902104)
      
      Cassandra did not support said indexes until this very release. Even with secondary indexes, storing data in columns is still a reasonable design choice for many requirements. A column in Cassandra is not like a column in a relational database.
      I am sure that this is welcome news for big Cassandra users, but I do agree that it is a strange choice for the front page of Slashdot. Then again, with the number of comments asking why you would need so many columns, it seems that Slashdot needs to talk about Cassandra a little more.
      
      Parent Share
      twitter facebook
      - Re:Typical applications? (Score:5, Informative)
        
        by bjourne ( 1034822 ) writes: on Monday January 17, 2011 @08:28AM (#34903356) Homepage Journal
        
        Maybe Cassandra should have choosen some other terminology for their database that so obviously doesn't conflict with already existing terms. A column in Cassandra is a tuple which in an RDBMS is a row. Confusion all around.
        
        Parent Share
        twitter facebook
- Re: (Score:2)
  
  by Musically_ut ( 1054312 ) writes:
  
  What sorta applications need so many columns? Curious.
  From the article:
  An open source database capable of holding such lengthy rows could be most useful to big data cloud computing projects and large-scale Web applications, the developers behind the Apache Software Foundation project assert.
  So, basically, they don't know either but think (probably rightly so) that this a pretty cool feature. So cool that they made this the heading of their article.
- Re:Typical applications? (Score:5, Interesting)
  
  by gratuitous_arp ( 1650741 ) writes: on Sunday January 16, 2011 @09:20PM (#34900970)
  
  Apparently the extra columns can be used to the effect of doing "more" than store data. A link in the article explains how lots of extra columns can be useful for querying data (Casandra doesn't use SQL). http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/ [maxgrinev.com]
  So the primary reason for this doesn't seem to be that one's run-of-the-mill database needs more columns.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by RobertM1968 ( 951074 ) writes:
    
    Apparently the extra columns can be used to the effect of doing "more" than store data. A link in the article...
    Not sure what that last word means....
- Re:Typical applications? (Score:4, Interesting)
  
  by SQL Error ( 16383 ) writes: on Sunday January 16, 2011 @09:31PM (#34901026)
  
  The main reason was that Cassandra prior to 0.7 didn't support secondary indexes. Your keys in a table ("columnfamily" in Cassandra-speak) were indexed, and the names of the columns in a row were indexed. And Cassandra is schemaless, so the columns in one row could be completely different to the columns in another.
  So you'd use columns as sub-records to get the data structures you need.
  With 0.7 and secondary indexes, that's going to be less important.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by AuMatar ( 183847 ) writes:
    
    Wow. I'm trying to think of any idea that could possibly beg for more bugs and bad design decisions than a feature like that. And I'm not coming up with anything.
    - Re: (Score:3)
      
      by Hognoxious ( 631665 ) writes:
      
      Obviously you're trapped in a relational mindset.
      That makes two of us.
- Re:Typical applications? (Score:5, Funny)
  
  by jrumney ( 197329 ) writes: on Sunday January 16, 2011 @09:39PM (#34901064)
  
  What sorta applications need so many columns?
  Facebook needs one column for every privacy violation.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Whiternoise ( 1408981 ) writes:
  
  I can only think of something where you might want to input something ridiculously large like an image (or similar matrix of information with millions of points) so you could perform statistical analysis on a per-pixel basis. The pixel example would be for an image, but if you wanted to store something like, say, some parameter at a grid point and you wanted to compare those parameters between a load of different grids. It seems a very laborious way of doing things, but maybe if each point is storing a lo
  - Re: (Score:3)
    
    by NNKK ( 218503 ) writes:
    
    Cassandra doesn't have "tables", and Cassandra's rows and columns have nothing to do with the rows and columns you're used to in SQL databases. Until you understand this, you will continue to be confused.
    The "name" of a column is an arbitrary key -- you could have a row with a bunch of columns named things like "Country", or "Username", but you could also have columns named "jsmith", "jdoe", "12345", "USA", "Canada", etc., and you don't have to pre-define the column names.
    - Re: (Score:3)
      
      by Mr Z ( 6791 ) writes:
      
      In perl-speak, a Cassandra table sounds suspiciously like a nested hash if Cassandra's rows and columns are unsorted, or an array of array of key-value pairs if they are sorted.
      And if I understood the brief description of the use model from the article someone else linked, it sounds like you make a new table (columnfamily?) for each of the different criteria you might query against. The index for that table would be the parameterized bits of that query, and the other columns represent all the data that wou
      - Re:Typical applications? (Score:4, Informative)
        
        by Sarten-X ( 1102295 ) writes: on Monday January 17, 2011 @02:56AM (#34902336) Homepage
        
        Close. It's more of a hash table of a sorted hash table... Columns are unsorted, but rows are (I think... I've only used HBase personally).
        If you know what you'll be looking for ahead of time, you can make your life easy with a write-heavy system. What's missing in standard Cassandra is a way to run ad-hoc queries. My understanding is that Cassandra can now run with Hadoop's MapReduce framework. Any query or computation can be run against the Cassandra table in a widely-distributed fashion as a MapReduce job. It's not as fast as an SQL query on an indexed column, but far better than a query on an unindexed one, because everything runs in parallel across the cluster.
        
        Parent Share
        twitter facebook
- Re: (Score:2)
  
  by g4b ( 956118 ) writes:
  
  my first applicable use would be to have one row saving a domain in the first column (or other fix data in a fixed number of additional columns) and change the db dimensions on the fly by adding multiple columns serially saving information like access time and ip adress.
  that would mean i can search by row to get the domain and log accesses easy
  i would just try that and look if this is speedier than having two tables saving by row.
  also comes in handy to add a column for each new user and a row for each new t
- Nobody read "Jurassic Park"? (Score:2)
  
  by Ken Hall ( 40554 ) writes:
  
  As I recall, one of the tasks given to Nedry in the design of the computer systems was to devise a database capable of holding a couple of billion fields to handle the sequencing of DNA strands.
  - Re: (Score:2)
    
    by DavidTC ( 10147 ) writes:
    
    That is possibly the stupidest design imaginable. You wouldn't be storing each DNA sequence in a field. DNA is full of variable length stuff, so that one slight insertion or deletion or change between species would result in the rest of the fields being offset, which would a) be hell to actually update, and b) entirely pointless because you can't compare them or search for them as fields.
    I'm not entirely sure what you would be doing, but it wouldn't be that. I'm not entirely sure what the Jurassic Park peo
- Re: (Score:3, Informative)
  
  by red_blue_yellow ( 1353825 ) writes:
  
  Columns in Cassandra aren't analogous to columns in an RDBMS. Every row is basically a list of (key, value) pairs. This is referred to as a column, with the key being the column name. There's no requirement that rows have the same set of column names.
  Typically large rows are used for indexes or timelines. In a timeline example, you might use a timestamp for every column name and store the entry as the column value. Cassandra keeps the row sorted by column name, so all of the entries in the row (timelin
- Re: (Score:2)
  
  by AHuxley ( 892839 ) writes:
  
  312000000 by ~50 states/extraterritorial jurisdiction/territories with a code from each Fusion center. Add in extra space for other 3 letter agencies, faith/militia/gang/vet details, no fly list flagged ... gets big fast.
- Re: (Score:2)
  
  by WuphonsReach ( 684551 ) writes:
  
  What sorta applications need so many columns? Curious.
  
  Sample collection data where you are collecting a few hundred individual loosely related (and often completely unrelated other then the sample number) attributes per sample. For the most part, due to a lot of databases having a 255 column limit, this means you have to have multiple data tables. Which may or may not be a problem depending on how you need to report the data.
- Re: (Score:3)
  
  by ultranova ( 717540 ) writes:
  
  What sorta applications need so many columns? Curious.
  
  Judging by the name, a pretty incredible one.
- - - Re: (Score:2, Funny)
      
      by adonoman ( 624929 ) writes:
      
      No no, one column for each resident, plus a column for the row header. Each row holds one item of information: Name, address, etc...
      That way, adding a new data point to keep track of is a simple as inserting a new row.
      - Re: (Score:3)
        
        by Sarten-X ( 1102295 ) writes:
        
        I don't know if that was sarcastic or not, but given that Cassandra is column-oriented, that's pretty much right (not so much with the header, but metadata is likely). Use a column family for each region, and you can process statistics in small chunks without a ridiculously-overpowered server. Only the requested column families need to be loaded into memory for processing.
        
        Re: (Score:3, Interesting)
        
        by DavidTC ( 10147 ) writes:
        
        Wow, it's almost like you've invented databases, but rotated 90 degrees so that every single existing programming paradigm fails and you have to invent new ones to loop through columns.
        Instead of what every other database does, load the rows you want, and just those rows. With nicely named headers that get used to label the parts of each row. Oh, and types that vary per column.
        And indexes on columns...wait, let me guess, you can now index rows...although that can't actually work, programmaticly, because t
        
        Re:Typical applications? (Score:5, Informative)
        
        by Sarten-X ( 1102295 ) writes: on Monday January 17, 2011 @02:16AM (#34902214) Homepage
        
        Welcome to the first five minutes of using a column store. Screwey, ain't it?
        My understanding is that rows' contents are indexed such that they may be retrieved quickly. Think of a row name as a primary key. It's easy to get the whole row when you know its name. Continuing the census application, it's be like asking for all the birth years of everyone in a geographical region. The requested column family (geographical region) is opened, and each column (person) is quickly checked for the particular row's contents (in case the birth year wasn't provided). Partitioning is done by both row and column family, so only some of the column family's data is actually scanned. That's where the cluster provides a very nice speedup, as well.
        locating a value in a specific row can't tell how to retrieve that entire column
        Now, I'm not sure if I understand your rage-induced rambling correctly, but if you're trying to make a SQL example, you're starting from the wrong premise, which explains why you're having trouble making sense of it all.
        Quick review: The "R" in "RDBMS" stands for "relational", referring to a n-ary relation. SQL is intended to manipulate those relations, isolating the data you want to extract. Something that is not described as an RDBMS should not be expected to have relations.
        Cassandra functions (from the application perspective) as a key-value store, with no relation structure. That means you don't work with sets, and you don't need to think about set operations. Pull out a row, and you get a list of columns with defined values, as well as those values. Iterate through each value looking for whatever value you're looking for. When you find it, you already have the column name. Just ask for the whole column next. Since the whole thing is running in a cluster, you can parallelize the iterations (I think... I've used HBase, but not Cassandra personally) to speed up the scan.
        If that's not fast enough for you (which is likely), you can use Hadoop's MapReduce framework to scan each cell and create an index, possibly laid over the other table as just more rows & columns (though a different table would be better, from a sanity perspective). Since there's no mandatory structure, that's legit.
        Of course, that's only valid for this particular census application, which assumes that the only reason for the database is either basic statistics or something complex enough for a MapReduce program.
        It's entirely possible to run Cassandra arranged similar to a normal RDBMS. Use only a few column families with very specific columns (such as a single family for all the "Name, address, etc."). Throw in a bunch of index families, updated with MapReduce. Then, your processing can be a complex MapReduce job, iterating over each row with a particular set of rows meeting all your needed criteria. It'd be just like a normal RDBMS, except you have better scalability, and maintain indexes yourself.
        If the trouble of indexing is too much for you, you can follow Google's route with Colossus, which runs MapReduce-like tasks when rows are changed. That's your dynamic indexing.
        Here's some links to help your understanding:
        
        Looking to the future with Cassandra [digg.com]
        Understanding the Cassandra Data Model from a SQL Perspective [insidesystems.net]
        WTF is a SuperColumn? An Intro to the Cassandra Data Model [arin.me] (While reading this, I note some discrepancy of terms I've used due to my familiarity with HBase. Please excuse that.)
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by butlerm ( 3112 ) writes:
        
        Welcome to the first five minutes of using a column store.
        Calling Cassandra a "column store" or "column oriented database" is an abuse of the language. Real column oriented databases store "columns" of data in a linear sequential manner, so that they can be scanned in the fastest manner possible.
        Cassandra isn't like that - it stores denormalized rows with repeating groups in a free form manner, not "columns" at all. If it were a real column oriented database it would be completely unusable for most online
    - Re:"would be stored in the rows..." (Score:2)
      
      by Joce640k ( 829181 ) writes:
      
      You've not done much outsourcing, have you?
- - Re: (Score:2)
    
    by TooMuchToDo ( 882796 ) writes:
    
    Nope. We use flat files for storing collider data from the LHC.
    - Re: (Score:2)
      
      by The_Wilschon ( 782534 ) writes:
      
      I don't know if I would describe ROOT's abominable TFiles as "flat files". I'm not even sure I would describe them as "files"... Certainly TTrees are extremely like an object database, except poorly designed, described, and implemented.
  - Re: (Score:2)
    
    by Daniel Dvorkin ( 106857 ) * writes:
    
    Speaking as a bioinformatician who does a lot of DB work (the only one in the lab who has professional DBA experience ...) and I'll be the first to say that I can't see myself storing data this way. I'd be willing to be convinced, but as it stands, I don't see any use for this. IMO, YMMV, etc.
- - - Re: (Score:3)
      
      by WuphonsReach ( 684551 ) writes:
      
      I've been in the business for more than two decades, and I have never ever encountered a situation where I need 256(!) columns. True, I have worked mostly in tech/business sectors, and that's why I asked the question: what sorta application need so many columns.
      
      Data collection where you are reporting across samples (averages, means, group by) but where you are collecting dozens or hundreds of generally unrelated attributes for each sample. Some attributes might be related, but only loosely, other attrib
Only 2 billion? (Score:2)

by Jeremi ( 14640 ) writes:

They should have gone with the uint32_t counter, then they could support up to 4 billion!
- Re:Only 2 billion? (Score:5, Funny)
  
  by Anonymous Coward writes: on Sunday January 16, 2011 @09:19PM (#34900966)
  
  You work for Gillette, don't you.
  
  Parent Share
  twitter facebook
  - Re:Only 2 billion? (Score:5, Funny)
    
    by zach_the_lizard ( 1317619 ) writes: on Sunday January 16, 2011 @11:39PM (#34901620)
    
    He doesn't, otherwise it'd be uint64_t and a lather strip!
    
    Parent Share
    twitter facebook
Bah, this is silly. (Score:2)

by intellitech ( 1912116 ) * writes:

If this really matters at all, besides being slightly cool, it will just lead to more bad db design.
- - Re: (Score:3)
    
    by Sarten-X ( 1102295 ) writes:
    
    on the fly
    Like storing the contents of a web crawl. The row key is the URL, the column is the crawl timestamp, and the cell contains the page (or keywords). That's a column created on the fly. Another application off the top of my head is storing access logs, where each row is a date, each column is a person, and each cell contains a resource they accessed. Having two billion columns is hardly excessive (in theory) for a suitably-large application.
    Cassandra, like BigTable and HBase, is not the same as a traditional
Why? (Score:4, Insightful)

by Xoc-S ( 645831 ) writes: on Sunday January 16, 2011 @09:08PM (#34900870)

Only a completely de-normalized flat-file database would need anything like that number of columns. That would mean many duplicate pieces of information, and a complete maintenance nightmare. The only purpose I can see is to have views of existing normalized data for fast searching, but that would be read-only data.
This is a feature in need of an application and I can see very few applications.

Share
twitter facebook
- Re:Why? (Score:4, Funny)
  
  by Jeremi ( 14640 ) writes: on Sunday January 16, 2011 @09:15PM (#34900934) Homepage
  
  This is a feature in need of an application and I can see very few applications.
  I think you're right, but as long as we're adding features for the sake of having features... why limit the table to two dimensions? Perhaps the next version of Cassandra can support 3D-data-cubes, with each cell specified via a (row,column,level) triplet. And the version after that will allow hypercubes of data with any number of dimensions (up to 2 billion dimensions maximum, of course).
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Sarten-X ( 1102295 ) writes:
    
    Disclaimer: I haven't used Cassandra personally, but I have used HBase which operates similarly.
    Cassandra uses column families, which are groups of columns, and are individually selectable. If all families contain the same columns, you have 3D (family, column, row) storage! Now, with HBase, excessive column family creation and maintenance isn't the ideal route, but if you actually need 3D storage, it would work pretty decently.
    Cassandra, BigTable, and HBase are designed for applications that need lots of ra
    - Re: (Score:3)
      
      by Daniel Dvorkin ( 106857 ) * writes:
      
      As an example, let's consider a forum. One row per thread, one column per post.
      Um, okay, but why would you set your database up that way in the first place? I really don't see the advantage of this over a more standard table table having columns for, say, forum ID, thread ID, poster ID, timestamp, and content.
      - Re: (Score:3)
        
        by Dynedain ( 141758 ) writes:
        
        In the example you just made, I can see that the benefit is that you don't need another layer (PHP, stored queries, etc) to stitch the thread back together. The data structure inherently "knows" how the thread of posts are assembled.
      - Re: (Score:3)
        
        by Sarten-X ( 1102295 ) writes:
        
        Why not, if you expect to have several billion posts?
        The more important issue in this architecture decision would be scaling needs and abilities. How many billion rows can a typical RDBMS handle on a $20,000 budget? If that budget goes to $40,000, will that capacity double? With a column-oriented database, only the needed column families are loaded into memory. For this forum example, you could have a family for each month of operation. Old threads would then be entirely in old column families, so they woul
        
        Re: (Score:2)
        
        by c6gunner ( 950153 ) writes:
        
        For this forum example, you could have a family for each month of operation.
        So what's the difference between that and using a typical SQL database and just having a new table for each month of operation? Aren't tables loaded into memory as necessary?
  - Re: (Score:3)
    
    by maraist ( 68387 ) * writes:
    
    There are many problem-sets where you might like to perform associative mapping. If the keys and or values are large, you can easily hit the 2GB limit on a single primary key. Imagine if you felt that cassandra could help you in CPU node mappings.. Or weather patterns. The associations can be in the billions, and while you may or may not have a primary key for each main node, the association list may approach N. In traditional RDBMS, such large association mappings M:N tables, are impractical to travers
  - Re: (Score:2)
    
    by melted ( 227442 ) writes:
    
    You've just described BigTable. :-)
- Re: (Score:2)
  
  by NNKK ( 218503 ) writes:
  
  Only a completely de-normalized flat-file database would need anything like that number of columns. That would mean many duplicate pieces of information, and a complete maintenance nightmare. The only purpose I can see is to have views of existing normalized data for fast searching, but that would be read-only data.
  This is a feature in need of an application and I can see very few applications.
  Um, a very common answer to Cassandra questions is "denormalize". This is not an RDBMS, stop treating it like one.
- Re: (Score:2)
  
  by maraist ( 68387 ) * writes:
  
  Have you reviewed the BigTable architecture? The central idea is to store what would normally be normalized joined data instead as in-line column-families. Within a column-family, you have related columns that are effectively your name-value pairs. Each name in the name-value pair is called a column (which in RDBMS it would more likely be a table with 3 columns, foreign-key, name, value - but with the tremendous innefficiency of having to do the join). All this effectively means is that prior to this ve
2 billion columns... (Score:5, Funny)

by aBaldrich ( 1692238 ) writes: on Sunday January 16, 2011 @09:12PM (#34900904)

ought to be enough for everybody

Share
twitter facebook
- Re: (Score:2)
  
  by adamofgreyskull ( 640712 ) writes:
  
  Joke away but, going by some of the shit I've seen at TheDailyWTF [thedailywtf.com], that could well come back and bite you in the ass one day.
  - Re: (Score:2)
    
    by WGFCrafty ( 1062506 ) writes:
    
    Woooooooooooooooosh. WOOOSH wooosh woosh. Four woosh's ought to be enough for you.
This is a triumph for hideously bad schema (Score:5, Informative)

by Sarusa ( 104047 ) writes: on Sunday January 16, 2011 @09:16PM (#34900940)

Well good on them for solving an interesting technical problem, but the use cases for this are all bad.
Obvious first use: boss will suggest we optimize the database by using only one gigantic row with two billion columns.

Share
twitter facebook
- Re: (Score:2)
  
  by teknopurge ( 199509 ) writes:
  
  Database? Psha - we only use Excel for our most critical data storage needs....
- Re: (Score:3)
  
  by Jugalator ( 259273 ) writes:
  
  This is a triumph for hideously bad schema
  This isn't a relational database. There is no schema. [/matrix]
Thank goodness! (Score:2)

by wonkavader ( 605434 ) writes:

Now I can finally shoe-horn my coworkers' Excel spreadsheets into a database.
for those that absolutely positively cannot RTFA (Score:5, Informative)

by Son of Byrne ( 1458629 ) writes: on Sunday January 16, 2011 @09:28PM (#34901012) Journal

Cassandra appears to be a multi-dimensional datastore that does not store data in the same fashion as a typical RDBMS. It uses columns and rows both to store sets of data uniquely. If you're familiar with Big Table, then, apparently, its kinda like that.
That just means that they've added even more storage vectors to it than before...not sure why it made slashdot front page...

Share
twitter facebook
- Re: (Score:2)
  
  by Dahamma ( 304068 ) writes:
  
  Not knocking Cassandra, but basically it means that this metric of "2 billion columns", being completely different from the concept of RDBMS columns, really doesn't mean much from a comparative point of view...
  It's kinda like saying "that army of ants will conquer all nations, they have 2 billion soldiers!" :)
- - Re: (Score:2)
    
    by maraist ( 68387 ) * writes:
    
    I wonder if it's possible to represent a non-cartesian basis vector-space with a DB. Maybe one of the columns is sinusoidally looped - haha,, every 32nd insert wraps around itself.. Oh this could be a cool MLK holiday project.
Cassandra (Score:5, Funny)

by tverbeek ( 457094 ) writes: on Sunday January 16, 2011 @09:49PM (#34901132) Homepage

I predict that bad things will come of this.
Not that anyone will believe me.

Share
twitter facebook
- Re: (Score:2)
  
  by thewils ( 463314 ) writes:
  
  I believe you :) There's a subset of coders who don't see anything wrong with "Select *" all over the place and I have a feeling this construct might chew up available memory real quick if a table has anywhere near this number of columns...
- Re: (Score:2)
  
  by WeatherGod ( 1726770 ) writes:
  
  Nah, that never will happen! (For those who didn't get it: http://en.wikipedia.org/wiki/Cassandra [wikipedia.org])
figured it out (Score:2)

by Bizzeh ( 851225 ) writes:

I know why the developers thought this would be a good idea. A feature this mental would be sure to get them free publicity on slashdot
- Re: (Score:3)
  
  by mini me ( 132455 ) writes:
  
  A column in Cassandra is sort of, if you have to make a comparison, like a join in SQL. Using Slashdot as an example, the topic would be the row, and each comment within that topic would be a column. Wanting to store more than 2GB of column data doesn't seem mental at all.
  Whether or not it is worthy of the front page is another question.
  - Re: (Score:2)
    
    by butlerm ( 3112 ) writes:
    
    Non-relational databases that do this have been around for decades. Adabas and Pick are the examples that come to mind. The pertinent difference here is that the developers of those databases were sane enough not to call repeating groups "columns".
    - Re: (Score:2)
      
      by mini me ( 132455 ) writes:
      
      Comparing it to a join was, of course, an oversimplification. Cassandra is not relational, so it is difficult to directly compare features with a relational database.
      Cassandra utilizes ideas from column-oriented databases [wikipedia.org] to store its data, so it is not wrong to call them columns. They are just not columns in the relational database sense.
      - Re: (Score:3)
        
        by butlerm ( 3112 ) writes:
        
        They are just not columns in the relational database sense.
        They are not columns _even_ in the sense that column oriented databases use. They are repeating groups. What column oriented databases call "columns" have a perfect logical correspondence with what relational databases call columns. Nothing about the relational model dictates either row or column orientation, so far as storage is concerned.
        The logical and physical structure of a Cassandra row has been used in some databases (Adabas, Pick, etc) fo
2 billion columns? (Score:2)

by flimflammer ( 956759 ) writes:

This sounds purely like marketing gibberish when you can't create enough meaningful features to boast about.
I can't even think of a reason why you would need 2 billion columns. If you did, I think the ability to store it is the least of your problems.
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
Indexes (Score:4, Informative)

by Twillerror ( 536681 ) writes: on Sunday January 16, 2011 @10:32PM (#34901328) Homepage Journal

Cassandra like many of the "no sql" type databases doesn't have classic indexes.
So instead of having an index you typically have a separate table that acts as the index.
Image you have a users table. One of the field is country. Now you want to know all the users for a particular country.
In standard RDMS type systems you just scan each row or have a index that has done that "ahead of time" or as rows are inserted.
In Cassandra the rows of users are distributed possibly among 100s of servers. So scanning for all users that have a particular country would require scanning all rows which could a long time.
Unlike RDMS like system rows don't have a 2d structure and don't have real limitation on the number of columns they can have. And columns can essentially be arrays\rows of objects.
So as you design/bang out your application you typically realize you need to know "users by country" for some stupid report. So you create a new table to hold these values. This has one row per country. As users are entered you append to this row. This essentially creates an array like structure. You then lookup the row for a particular country and you now know all the users for that particular country.
Sounds like Cassandra is getting rid of a limitation that could have caused very large index to require multiple rows.

Share
twitter facebook
- Yes and the funniest thing about all this is (Score:5, Insightful)
  
  by Giant Electronic Bra ( 1229876 ) writes: on Sunday January 16, 2011 @11:35PM (#34901606)
  
  That we had all of this stuff 30 years ago. It was called 'network' databases, which were pretty much the standard sort of technology before RDBMS came along and everyone realized how incredibly much better relational algebra was for the vast majority of problems. As with many other things older ideas eventually resurface with new names and a few more features. There are times when this kind of facility is useful. Nothing wrong with it. The vast majority of cases though where I've seen people using something like Cassandra or Big Table were ill advised. A properly optimized RDBMS with correctly designed schema can handle all but a few edge cases. Most of the hype these tools are generating is based on a lack of real understanding of how to properly use databases combined with people believing myths about other technologies and helped along by the industry's short memory span. The best part though is that when something turns into a giant mess guys like me can make nice money fixing the mess. lol.
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by DavidTC ( 10147 ) writes:
    
    The vast majority of cases though where I've seen people using something like Cassandra or Big Table were ill advised. A properly optimized RDBMS with correctly designed schema can handle all but a few edge cases. Most of the hype these tools are generating is based on a lack of real understanding of how to properly use databases combined with people believing myths about other technologies and helped along by the industry's short memory span.
    Indeed, and there are edge cases, like Facebook, or Google, or
    - Re: (Score:3, Informative)
      
      by red_blue_yellow ( 1353825 ) writes:
      
      Indeed, and there are edge cases, like Facebook, or Google, or whatever. The edge cases are gigantic databases that are accessed in certain specific way.
      It's true that many people attempt to prematurely optimize by using Cassandra first instead of something they are already familiar with. However, when faced with some of the pains of growing an RDMBS beyond what a single box can handle, it's worth it to consider your other options. Keep in mind that if it's easy to store and make use of a huge pile of data, you're more tempted to gather that data in the first place, where 10 years ago it might have been prohibitively expensive or difficult.
      There are probably less edge cases than actual NoSQL codebases, which is pretty surreal. There are more actual products then the number of people who need the products. And 99.99% of the people playing with them don't need them at all.
      I can assure yo
      - Re: (Score:3)
        
        by Giant Electronic Bra ( 1229876 ) writes:
        
        My comment would just be along the lines of what the DavidTC stated though, in the case where that kind of technology is warranted you're either in a huge organization with very specialized needs or well beyond the competency level of small shops. It isn't so much a factor of being able to find a tool that could do the job. It is a matter that the various factors going into that kind of scale of system are so complex and varied. You need expertise in large scale mass storage, clustering, management, etc to
Introduction to Cassandra (Score:2)

by Fnord666 ( 889225 ) writes:

Here [maxgrinev.com] is a link to to an introduction to the Cassandra database system. One thing to realize is that Cassandra is one of the new "noSQL" DBMS. These operate very differently than an RDBMS such as Oracle or DB2.
Paging Microsoft (Score:2)

by Nom du Keyboard ( 633989 ) writes:

Now if only Excel would follow.
SELECT * FROM TWO_BILLION_COLUMN_TABLE; (Score:2)

by timeOday ( 582209 ) writes:

Man this is great! Now I only need one table and never have to JOIN again. Most of the rows won't use most of the columns but that's what NULL is for, am I right?
Finally! (Score:3)

by Compaqt ( 1758360 ) writes: on Sunday January 16, 2011 @11:50PM (#34901660) Homepage

I'm was having trouble making a table for my new Web 3.0 m-commerce application on lesser databases:
CREATE TABLE peeps(
peep1_first_name VARCHAR(255),
peep1_last_name VARCHAR(255),
peep1_address VARCHAR(255),
peep1_address2 VARCHAR(255),
peep1_address3 VARCHAR(255),
peep1_creditcard VARCHAR(255),
peep1_creditcard2 VARCHAR(255),
peep1_creditcard3 VARCHAR(255),
peep2_first_name VARCHAR(255),
peep2_last_name VARCHAR(255),
peep2_address VARCHAR(255),
peep2_address2 VARCHAR(255),
peep2_address3 VARCHAR(255),
peep2_creditcard VARCHAR(255),
peep2_creditcard2 VARCHAR(255),
peep2_creditcard3 VARCHAR(255), ...
509 Bandwidth Limit Exceeded

Share
twitter facebook
And Oracle supports EXABYTE sized databases (Score:4, Interesting)

by dirkdodgers ( 1642627 ) writes: on Monday January 17, 2011 @12:14AM (#34901752)

So I can appreciate that this announcement sounds like News for Nerds, but can someone why it Matters that Cassandra can support 2 billion columns?
The article basically says "because you can't execute SQL you need lots of columns". OK, great, why would I want that? The article doesn't tell me. The Cassandra website sure doesn't tell me.
Oracle 11 supports up to 8 fucking EXABYTES of data in an RDBMS that I can execute SQL against. What Cassandra puts in columns, I put in rows.
I've scoured this thread like all the other ones on Cassandra for the killer feature, for the "you can do this with Cassandra that you can't do as well with an RDBMS" and I can't find it.
The best I can come up with is "I want to store lots of indexed data, I don't care about transactional integrity, and I don't want to pay Oracle". Is that it? That's fine if it's it, Oracle doesn't come cheap and that can be a deal breaker for new companies, but I just wish someone would spell out that this is the justification for Cassandra's existence.

Share
twitter facebook
- Re: (Score:2)
  
  by melted ( 227442 ) writes:
  
  The killer feature is that it actually horizontally scalable and fault-tolerant out of the box.
  - Re: (Score:2)
    
    by teknopurge ( 199509 ) writes:
    
    The killer feature is that it actually horizontally scalable and fault-tolerant out of the box.
    So is Postgres. Like the OP, I'm still waiting for a good reason to use NoSQL-type storage. I have to agree that these are all solutions looking for problems: trying to re-invent the wheel for no other reason then they don't know how to correctly do it with the existing products.
    - Re: (Score:3)
      
      by melted ( 227442 ) writes:
      
      Try to deploy Postgres on a 5000 machine cluster, with replication and failover and then get back to me. And by "failover" here I mean the entire racks or ever network segments going away with nary a hiccup in serving, no manual intervention (except for bringing up replacement nodes), and no data loss.
      Then there's the issue of RDBMSs being suboptimal for straightfoward user profile storage. You have to implement a lot of things by hand. Cassandra (or BigTable) gives you a versioned, fault tolerant, scalable
      - Re: (Score:3)
        
        by melted ( 227442 ) writes:
        
        You still haven't described how you'd implement sharding (at which point with most realistic relational schemas you're likely to lose the ability to do joins), load balancing and transparent, reliable failover. SANs don't free you from the need to do DB replication, since several processes can't write to the same set of files without massive synchronization overhead, so you're in a losing position right from the start, SANs notwithstanding.
        But even assuming you got all of this to work, you're still doing it
- Re:And Oracle supports EXABYTE sized databases (Score:5, Interesting)
  
  by DavidTC ( 10147 ) writes: <slas45dxsvadiv.v ... m ['box' in gap]> on Monday January 17, 2011 @02:01AM (#34902176) Homepage
  
  NoSQL stuff is useful in weird extreme fringe cases, where you need to access data in essentially random ways. Digg, Facebook, and Google all NoSQL databases, and I think the first two use Cassandra.
  Specifically, you kinda make your own rows. It's like having permanent multiple JOINs that you can access instantly, from what I understand. (This is what this article is talking about, it's now unlimited.)
  Essentially, it's a giant blob of data that exists, and you draw lines on it in advance that are your results, and you can get those result instantly, at the cost of being unable to decide to get other results in real time.
  Many of the products let you have them on different servers, so you can have a 'people who have voted for this Digg' table or something, on the server that handles that thing.
  I'm not entirely sure how it works, but that's basically it. Oh, and the fact they talk about 'columns' and 'rows' is just utter stupidity in naming to confuse everyone. Basically, they simply tend to keep each column as a file, which allows them to do what I mentioned above..copy needed columns, and just needed columns, to other servers.
  It's really weird, and, like I said, only relevant for giant giant databases. There's no way that google could do a full text search on a RDBMS, regardless if it fits in Oracle. What it can do is make a 'column' for each word, and a 'row' for each URL, put different columns on different servers, and that actually works in the non-relational database they use, when there's no way in hell that would work on a RDBMS.
  However, more importantly for slashdot, a fuckload of fools think that SQL is somehow 'retarded' and that NoSQL is 'awesome, dude', so they like to play with it, usually by spewing out some crap PHP or Perl or something that works about a tenth as well as just using an RDBMS would work. If they actually understood how to use an RDBMS, that is.
  
  Parent Share
  twitter facebook
Wonderful (Score:2)

by dynamo ( 6127 ) writes:

This is great for those of us in the database community who are purists about only using one row of data.
- Re:If you have more than 30 columns (Score:5, Informative)
  
  by ogrisel ( 1168023 ) writes: on Sunday January 16, 2011 @09:35PM (#34901044)
  
  Not with column store databases such as Cassandra, HBase and BigTable.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by butlerm ( 3112 ) writes:
    
    Cassandra, HBase and BigTable aren't traditionally what is meant by the term href="http://en.wikipedia.org/wiki/Column-oriented_DBMS">column store database at all. Much closer to hybrid "repeating group" databases like Adabas [wikipedia.org] and Pick [wikipedia.org].
    True column store databases are almost unheard of for online transaction processing because they are optimized for streaming, unindexed data storage and subsequent column oriented analysis over large datasets with very low per row overhead. A bitmap index is the closest
- Re: (Score:3)
  
  by mini me ( 132455 ) writes:
  
  If you are writing SQL, maybe. Cassandra is not a relational database.
- - Re: (Score:2)
    
    by Mitchell314 ( 1576581 ) writes:
    
    Two words: normalization.
  - Re: (Score:2)
    
    by DavidTC ( 10147 ) writes:
    
    I call bullshit.
    No one answers 'tens of thousand' question questionnaires. At 5 seconds a question, that's 28 hours for 20,000 questions. (And let's not even hypothesis how long the damn results would take to read.)
    Alternately, you think you need more than one field a question, which means you are doing it wrong.
    I don't know what you mean 'careful redesign into a relational structure' either. A sane design might be to remove the person's info to another table, if people answer more than one questionaire
- Re: (Score:3)
  
  by Sarten-X ( 1102295 ) writes:
  
  Cassandra doesn't use SQL, and isn't even like a RDBMS in any way other than "it stores a table of data", so the SQL statement would be nonexistent.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Typical applications? (Score:3, Interesting)

Re:Typical applications? (Score:5, Funny)

Re: (Score:2)

Re:Typical applications? (Score:4, Funny)

Re:Typical applications? (Score:4, Informative)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:Typical applications? (Score:4, Insightful)

Re:Typical applications? (Score:4, Interesting)

Re: (Score:2)

Re:Typical applications? (Score:4, Informative)

Re:Typical applications? (Score:5, Informative)

Re: (Score:2)

Re:Typical applications? (Score:5, Interesting)

Re: (Score:3, Funny)

Re:Typical applications? (Score:4, Interesting)

Re: (Score:2)

Re: (Score:3)

Re:Typical applications? (Score:5, Funny)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re:Typical applications? (Score:4, Informative)

Re: (Score:2)

Nobody read "Jurassic Park"? (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2, Funny)

Re: (Score:3)

Re: (Score:3, Interesting)

Re:Typical applications? (Score:5, Informative)

Re: (Score:3)

Re:"would be stored in the rows..." (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Only 2 billion? (Score:2)

Re:Only 2 billion? (Score:5, Funny)

Re:Only 2 billion? (Score:5, Funny)

Bah, this is silly. (Score:2)

Re: (Score:3)

Why? (Score:4, Insightful)

Re:Why? (Score:4, Funny)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

2 billion columns... (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

This is a triumph for hideously bad schema (Score:5, Informative)

Re: (Score:2)

Re: (Score:3)

Thank goodness! (Score:2)

for those that absolutely positively cannot RTFA (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Cassandra (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

figured it out (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

2 billion columns? (Score:2)

Re: (Score:2)

Indexes (Score:4, Informative)

Yes and the funniest thing about all this is (Score:5, Insightful)