Cassandra Basics

Introduction

Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Originally developed at Facebook to power their Inbox Search feature, Cassandra was released as an open-source project in 2008 and later became a top-level Apache project.

Cassandra is particularly well-suited for applications that cannot afford to lose data, require high performance, and need to scale horizontally with ease. Unlike traditional relational databases, Cassandra uses a ring architecture and offers a flexible schema that can adapt to changing application requirements.

Key Features of Cassandra

Distributed: Data is automatically replicated across multiple nodes
Decentralized: No single point of failure; every node is identical
Scalable: Linear scalability—just add more nodes to increase capacity
Fault-Tolerant: Data is replicated across nodes for redundancy
Tunable Consistency: Balance between consistency and availability
High Performance: Optimized for write operations
Column-oriented: Stores data in column families rather than rows

Cassandra Architecture

Cassandra employs a masterless "ring" architecture that distributes data across multiple nodes. This design provides several advantages:

Node Structure

Each node in a Cassandra cluster contains:

Partitioner: Determines how data is distributed across nodes
Snitch: Defines network topology and helps route requests efficiently
Gossip Protocol: Communication method for nodes to share their state
Memtables: In-memory write-back caches for recent writes
SSTables: Immutable, persistent data files on disk

Data Distribution

Cassandra distributes data using consistent hashing. When data is written:

The partitioner hashes the partition key
The hash value determines which node stores the data
Data is replicated to multiple nodes based on the replication factor
Write operations are considered successful when enough replicas acknowledge

Cassandra Data Model

Cassandra organizes data into:

Keyspace: Similar to a database in RDBMS
Table/Column Family: Similar to a table in RDBMS
Row: A collection of columns identified by a primary key
Column: A name-value pair with a timestamp

Data Modeling Concepts

When designing a Cassandra data model, remember:

Query-driven design: Model your tables based on the queries you'll run
Denormalization: Duplicate data to avoid joins (Cassandra doesn't support joins)
Wide rows: Group related data in a single row using clustering columns
Primary keys: Consist of partition key (determines data distribution) and clustering columns (determines data ordering)

Basic Cassandra Query Language (CQL)

Cassandra Query Language (CQL) provides a SQL-like interface for interacting with Cassandra. Let's look at some basic CQL operations:

Create a Keyspace

sql
CREATE KEYSPACE ecommerce 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

Create a Table

sql
USE ecommerce;

CREATE TABLE products (
  product_id UUID PRIMARY KEY,
  name TEXT,
  category TEXT,
  price DECIMAL,
  inventory INT,
  description TEXT
);

Insert Data

sql
INSERT INTO products (product_id, name, category, price, inventory, description)
VALUES (uuid(), 'Laptop XPS 15', 'Electronics', 1299.99, 50, 'High-performance laptop');

INSERT INTO products (product_id, name, category, price, inventory, description)
VALUES (uuid(), 'Coffee Maker', 'Kitchen', 89.99, 100, 'Programmable coffee maker');

Query Data

sql
-- Get all products
SELECT * FROM products;

-- Get a specific product
SELECT * FROM products WHERE product_id = 550e8400-e29b-41d4-a716-446655440000;

-- Get products by category (requires a secondary index)
CREATE INDEX ON products(category);
SELECT * FROM products WHERE category = 'Electronics';

Update Data

sql
UPDATE products 
SET price = 1199.99, inventory = 45 
WHERE product_id = 550e8400-e29b-41d4-a716-446655440000;

Delete Data

sql
DELETE FROM products 
WHERE product_id = 550e8400-e29b-41d4-a716-446655440000;

Consistency Levels

One of Cassandra's key features is tunable consistency. You can specify the consistency level for each operation, balancing between data consistency, availability, and partition tolerance.

Some common consistency levels:

ONE: Write/read must be confirmed by at least one node
QUORUM: A majority of replicas must respond
ALL: All replicas must respond
LOCAL_QUORUM: A majority of replicas in the local datacenter must respond
EACH_QUORUM: A majority of replicas in each datacenter must respond

Example of setting consistency level in a query:

sql
CONSISTENCY QUORUM;
SELECT * FROM products WHERE product_id = 550e8400-e29b-41d4-a716-446655440000;

Real-World Use Case: Product Catalog

Let's design a simple product catalog for an e-commerce platform. We need to:

Store product information
Allow queries by product ID
Allow queries by category
Track inventory
Maintain product reviews

Schema Design

sql
-- Create keyspace
CREATE KEYSPACE ecommerce 
WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': 3, 'DC2': 2};

USE ecommerce;

-- Products table
CREATE TABLE products (
  product_id UUID PRIMARY KEY,
  name TEXT,
  category TEXT,
  price DECIMAL,
  inventory INT,
  description TEXT,
  created_at TIMESTAMP,
  last_updated TIMESTAMP
);

-- Create an index for category queries
CREATE INDEX ON products(category);

-- Products by category (for efficient category-based queries)
CREATE TABLE products_by_category (
  category TEXT,
  product_id UUID,
  name TEXT,
  price DECIMAL,
  inventory INT,
  PRIMARY KEY (category, product_id)
);

-- Product reviews
CREATE TABLE product_reviews (
  product_id UUID,
  review_id TIMEUUID,
  customer_id UUID,
  rating INT,
  review_text TEXT,
  review_date TIMESTAMP,
  PRIMARY KEY (product_id, review_id)
) WITH CLUSTERING ORDER BY (review_id DESC);

Sample Data Operations

Insert a product:

sql
INSERT INTO products (
  product_id, name, category, price, inventory, description, created_at, last_updated
) VALUES (
  uuid(), 'Smartphone X', 'Electronics', 699.99, 200, 
  '5G smartphone with 6.7" display and 128GB storage', 
  toTimestamp(now()), toTimestamp(now())
);

Query products by category:

sql
SELECT * FROM products_by_category WHERE category = 'Electronics';

Add a product review:

sql
INSERT INTO product_reviews (
  product_id, review_id, customer_id, rating, review_text, review_date
) VALUES (
  550e8400-e29b-41d4-a716-446655440000, now(), 
  123e4567-e89b-12d3-a456-426614174000, 
  5, 'Great product, very happy with my purchase!', toTimestamp(now())
);

Get all reviews for a product:

sql
SELECT * FROM product_reviews 
WHERE product_id = 550e8400-e29b-41d4-a716-446655440000;

Performance Considerations

When working with Cassandra, keep these best practices in mind:

Minimize the number of partitions read for a single query
Keep related data in the same partition when possible
Use clustering columns for range queries
Avoid collection types for large datasets
Use the ALLOW FILTERING clause sparingly, as it can lead to performance issues
Set appropriate TTL (Time to Live) for data that should expire
Use batch statements carefully, as they can cause performance issues

Cassandra vs. Traditional RDBMS

Feature	Cassandra	RDBMS
Data Model	Column-family/NoSQL	Relational
Schema	Flexible	Rigid
Scaling	Horizontal (add nodes)	Vertical (upgrade server)
Consistency	Tunable	ACID
Joins	Not supported	Supported
Transactions	Limited support	Fully supported
Best for	Write-heavy, distributed applications	Complex queries, transactions

Setting Up Cassandra Locally

To get started with Cassandra on your local machine:

Install Java (JDK 8 or later)
Download Cassandra from Apache Cassandra
Extract the downloaded file

Set environment variables:

bash
export CASSANDRA_HOME=/path/to/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin

Start Cassandra:
bash
```
cassandra -f
```
Connect using CQL shell:
bash
```
cqlsh
```

Summary

Apache Cassandra is a powerful, distributed NoSQL database designed for high availability and scalability. Its key strengths include:

Masterless architecture with no single point of failure
Linear scalability by simply adding more nodes
Tunable consistency levels to balance between consistency and availability
High write throughput for data-intensive applications
Flexible schema that can evolve with your application needs

Cassandra excels in use cases like time-series data, product catalogs, recommendation engines, and other scenarios requiring high write throughput and horizontal scalability. While it may not be the best choice for applications requiring complex transactions or joins, its robust architecture makes it ideal for large-scale, distributed applications where uptime and fault tolerance are critical.

Additional Resources

Apache Cassandra Documentation
DataStax Academy - Free Cassandra courses
Cassandra: The Definitive Guide by Eben Hewitt & Jeff Carpenter

Exercises

Set up a local Cassandra instance and create a keyspace for a social media application.
Design a data model for storing user profiles, posts, and comments.
Write CQL queries to:
- Create the necessary tables
- Insert sample data
- Retrieve a user's posts
- Retrieve comments for a specific post
Experiment with different consistency levels and observe the behavior.
Implement a simple application that connects to your Cassandra instance and performs CRUD operations.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Key Features of Cassandra​

Cassandra Architecture​

Node Structure​

Data Distribution​

Cassandra Data Model​

Data Modeling Concepts​

Basic Cassandra Query Language (CQL)​

Create a Keyspace​

Create a Table​

Insert Data​

Query Data​

Update Data​

Delete Data​

Consistency Levels​

Real-World Use Case: Product Catalog​

Schema Design​

Sample Data Operations​

Performance Considerations​

Cassandra vs. Traditional RDBMS​

Setting Up Cassandra Locally​

Summary​

Additional Resources​

Exercises​