It seems my TB of storage is nothing compared to what these guys have..

Just skimmed through the numbers and you will shit bricks.

How big is big data really? From time to time, various organizations brag about how much data they have, how big their clusters are, how many requests per second they serve, etc. Every time I come across these statistics, I make note of them. It's quite amazing to see how these numbers change over time... looking at the numbers from just a few years ago reminds you of this famous Austin Powers scene. Here's another gem.

Without further adieu, here's "big" data, in reverse chronological order...

Pinterest (July 2014)

Some stats:

30 billion Pins in the system
20 terabytes of new data each day
around 10 petabytes of data in S3
migrated Hadoop jobs to Qubole
over 100 regular Mapreduce users running over 2,000 jobs each day through Qubole's web interface (ad-hoc jobs and scheduled workflows)
six standing Hadoop clusters comprised of over 3,000 nodes
over 20 billion log messages and process nearly a petabyte of data with Hadoop each day

Source: Pinterest Engineering Blog

Baidu (July 2014)

#sigir2014 #sirip Baidu processes 7 billion searches per day.
— Mark Sanderson (@IR_oldie) July 7, 2014

HDFS at Twitter (June 2014)

Max int in javascript 2^53=8 PB isn't enough to show some individual users' HDFS usage on our #Hadoop clusters. Overflow is so 20th century!
— Joep R. (@joep) June 27, 2014

ACL Lifetime Achievement Award (June 2014)

A perspective on progress of computing from Bob Mercer's ACL lifetime award talk. #acl2014
— Delip Rao (@deliprao) June 25, 2014

Airbnb (June 2014)

15 million total guests have stayed on Airbnb. It took us nearly 4 years to get our 1st million, and now we have 1M guests every month.
— Brian Chesky (@bchesky) June 11, 2014

DataTorrent at the Hadoop Summit (June 2014)

[David] Hornik [of August Capital] was also an early stage investor in Splunk, and he sees lots of potential here for DataTorrent. “When you can process a billion data points in a second, there are a lot of possibilities.”

Source: TechCrunch

What would you do with a system that could process 1.5 billion events per second? That’s the mind-boggling rate at which DataTorrent’s Real-Time Streaming (RTS) offering for Hadoop was recently benchmarked. Now that RTS is generally available–DataTorrent announced its general availability today at Hadoop Summit in San Jose–we may soon find out.

That 1.5-billion-events-per-second figure was recorded on DataTorrent’s internal Hadoop cluster, which sports 34 nodes. Each node is able to process tens of thousands of incoming events (call data records, machine data, and clickstream data are common targets) per second, and in turn generates hundreds of thousands of secondary events that are then processed again using one of the 400 operators that DataTorrent makes available as part of its in-memory, big-data kit.

Source: Datanami

Yahoo! and the Hadoop Summit (June 2014)

#hadoopsummit @yahoo - benchmark for interactive analytics: 60B events, 3.5TB data compressed, response time of <400ms. WOW! @QlikView
— bob hardaway (@bobhardaway) June 4, 2014

Wondering how big Hadoop clusters get? Yahoo is approaching 500 PB #hadoopsummit #bigdatap
— richardwinter (@richardwinter) June 3, 2014

[local cached copy]

oh thoseYyahoo stats, like 365 PB storage, 330k YARN nodes, ~4- 6 YARN jobs per second #HadoopSummit #bigdata term sounds almost quaint
— Tony Baer (@TonyBaer) June 26, 2013

#Hadoop at Yahoo - 365+ PB of HDFS storage, 30,000 nodes, 400,000 jobs per day, 10 million hours per day #hadoopsummit
— Jeff Kelly (@jeffreyfkelly) June 26, 2013

Snapchat (May 2014)

Snapchat claims that over 700 million snaps are shared per day on the service, which could make it the most-used photo-sharing app in the world — ahead of Facebook, WhatsApp, and others. Even Snapchat’s Stories feature seems to be doing well, amassing 500 million views per day.

Source: The Verge

Kafka at LinkedIn (May 2014)

Source: Samza at LinkedIn: Taking Stream Processing to the Next Level by Martin Kleppmann at Berlin Buzzwords on May 27, 2014

What "Big Data" Means to Google (May 2014)

David Glazer showing what #bigdata means @google #bigdatamed
— Andrew Su (@andrewsu) May 22, 2014

Tape Storage (May 2014)

Tale of the tape: 400+ Exabytes of data are stored on tape! #tapeworldrecord
— IBM Research (@IBMResearch) May 20, 2014

Here's a local copy of infographic.

Here's more bragging from IBM about achieving a new record of 85.9 billion bits of data per square inch in areal data density on low-cost linear magnetic particulate tape.

Google's Bigdata at HBaseCon2014 (May 2014)

Bigtable scale numbers from keynote talk at HBaseCon2014 by Carter Page:

Correction: BigTable at Google serves 2+ exabytes at 600M QPS organization wide. That's a scale quite challenging to conceptualize. Wow.
— Andrew Purtell (@akpurtell) May 6, 2014

More HBaseCon2014 highlights.

Hadoop at eBay (April 2014)

EBay's scaling of their Hadoop clusters is impressive.
— Owen O'Malley (@owen_omalley) April 3, 2014

Hive at Facebook (April 2014)

Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.

Source: Facebook Engineering Blog

Kafka at LinkedIn (April 2014)

Kafka metrics @ LinkedIN 300 Brokers, 18000 topics, 220 Billions messages per day... impressive! #apachecon
— Ronan GUILLAMET (@_Spiff_) April 7, 2014

Internet Archive (February 2014)

Wayback Machine updated, now 397,144,266,000 web objects in it, (html, jpg, css). Getting close to 400Billion. @internetarchive
— Brewster Kahle (@brewster_kahle) February 28, 2014

Google's BigTable (September 2013)

Jeff Dean at XLDB: Largest Google Bigtable cluster: ~100s PB data; sustained: 30M ops/sec; 100+ GB/s I/O (gmail?)
— Jimmy Lin (@lintool) September 17, 2013

Source: Jeff Dean's talk slides at XLDB 2013 [local copy]

Google's Disks (2013)

Estimate: Google has close to 10 exabytes of active storage attached to running clusters.

Source: What if?

For reference: Total disk storage systems capacity shipped (in 2013) reached 8.2 exabytes.

Source: IDC Press Release

NSA's datacenter (Summer 2013)

Blueprints Of NSA's Ridiculously Expensive Data Center In Utah Suggest It Holds Less Info Than Thought
The NSA Is Building the Country's Biggest Spy Center (Watch What You Say)
Capacity of the Utah Data Center

Amazon S3 (April 2013)

There are now more than 2 trillion objects stored in Amazon S3 and that the service is regularly peaking at over 1.1 million requests per second.

Source: Amazon Web Services Blog

Hadoop at Yahoo! (February 2013)

Around ~45k hadoop nodes, ~350 PB total

Source: YDN Blog

Internet Archive reaches 10 PB (October 2012)

Blog post about the Internet Archive's 10 PB party.

10,000,000,000,000,000 Bytes Archived! @internetarchive go open content movement
— Brewster Kahle (@brewster_kahle) October 26, 2012

#WaybackMachine updated with 240 billion pages! Go @internetarchive ! 5PB be #BigData
— Brewster Kahle (@brewster_kahle) January 10, 2013

Source: Internet Archive

Google at SES San Francisco (August 2012)

Google has seen more than 30 trillion URLs and crawls 20 billion pages a day. One hundred billion searches are conducted each month on Google (3 billion a day).

Source: Spotlight Keynote With Matt Cutts #SESSF (from Google)

Cassandra at eBay (August 2012)

eBay Marketplaces:

97 million active buyers and sellers
200+ million items
2 billion page views each day
80 billion database calls each day
5+ petabytes of site storage capacity
80+ petabytes of analytics storage capacity

A glimpse on our Cassandra deployment:

Dozens of nodes across multiple clusters
200 TB+ storage provisioned
400M+ writes & 100M+ reads per day, and growing
QA, LnP, and multiple Production clusters

Source: Slideshare [local copy]

The size, scale, and numbers of (June 2012)

We have over 10 petabytes of data stored in our Hadoop and Teradata clusters. Hadoop is primarily used by engineers who use data to build products, and Teradata is primarily used by our finance team to understand our business
We have over 300 million items for sale, and over a billion accessible at any time (including, for example, items that are no longer for sale but that are used by customers for price research)
We process around 250 million user queries per day (which become many billions of queries behind the scenes – query rewriting implies many calls to search to provide results for a single user query, and many other parts of our system use search for various reasons)
We serve over 2 billion pages to customers every day
We have over 100 million active users
We sold over US$68 billion in merchandize in 2011
We make over 75 billion database calls each day (our database tables are denormalized because doing relational joins at our scale is often too slow – and so we precompute and store the results, leading to many more queries that take much less time each)

Source: Hugh Williams Blog Post

Facebook at Hadoop Summit (June 2012)

HDFS growth at #facebook. 100 PB in some clusters. 200 million files #hadoopsummit
— chiradeep (@chiradeep) June 13, 2012

Amazon S3 Cloud Storage Hosts 1 Trillion Objects (June 2012)

Late last week the number of objects stored in Amazon S3 reached one trillion.

Source: Amazon Web Services Blog

Pinterest Architecture Update (May 2012)

18 Million Visitors, 10x Growth, 12 Employees, 410 TB Of Data.

80 million objects stored in S3 with 410 terabytes of user data, 10x what they had in August. EC2 instances have grown by 3x. Around $39K fo S3 and $30K for EC2.

Source: High Scalability Blog Post

Six Super-Scale Hadoop Deployments (April 2012)

Source: Datanami

Ranking at eBay (April 2012)

eBay is amazingly dynamic. Around 10% of the 300+ million items for sale end each day (sell or end unsold), and a new 10% is listed. A large fraction of items have updates: they get bids, prices change, sellers revise descriptions, buyers watch, buyers offer, buyers ask questions, and so on. We process tens of millions of change events on items in a typical day, that is, our search engine receives that many signals that something important has changed about an item that should be used in the search ranking process. And all that is happening while we process around 250 million queries on a typical day.

Source: Hugh Williams Blog Post

Modern HTTP Servers Are Fast (March 2012)

A modern HTTP server (nginx 1.0.14) running on somewhat recent hardware (dual Intel Xeon X5670, 6 cores at 2.93 GHz, with 24GB of RAM) is capable of servicing 500,000 Requests/Sec.

Source: The Low Latency Web

Tumblr Architecture (February 2012)

15 Billion Page Views A Month And Harder To Scale Than Twitter: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.

Source: High Scalability Blog Post

Digital Universe (2011)

In 2011 the world will create a staggering 1.8 zettabytes.

Source: IDC

DataSift Architecture (November 2011)

936 CPU Cores
Current Peak Delivery of 120,000 Tweets Per Second (260Mbit bandwidth)
Performs 250+ million sentiment analysis with sub 100ms latency
1TB of augmented (includes gender, sentiment, etc) data transits the platform daily
Data Filtering Nodes Can process up to 10,000 unique streams (with peaks of 8000+ tweets running through them per second)
Can do data-lookup's on 10,000,000+ username lists in real-time
Links Augmentation Performs 27 million link resolves + lookups plus 15+ million full web page aggregations per day.

Source: High Scalability Blog Post

Hadoop at Facebook (July 2011)

In 2010, Facebook had the largest Hadoop cluster in the world, with over 20 PB of storage. By March 2011, the cluster had grown to 30 PB.

Source: Facebook Engineering Blog

Hive at Facebook (May 2009)

Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon.

Source: DBMS2

Datawarehouses at eBay (April 2009)

Metrics on eBay's main Teradata data warehouse include:

>2 petabytes of user data
10s of 1000s of users
Millions of queries per day
72 nodes
>140 GB/sec of I/O, or 2 GB/node/sec, or maybe that's a peak when the workload is scan-heavy
100s of production databases being fed in

Metrics on eBay's Greenplum data warehouse (or, if you like, data mart) include:

6 1/2 petabytes of user data
17 trillion records
150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
96 nodes
200 MB/node/sec of I/O (that's the order of magnitude difference that triggered my post on disk drives)
4.5 petabytes of storage
70% compression
A small number of concurrent users

Source: DBMS2

The World's Technological Capacity (2007)

In 2007, humankind was able to store 295 exabytes.

Source: Science Magazine

All the empty or usable space on hard drives, tapes, CDs, DVDs, and memory (volatile and nonvolatile) in the market equaled 264 exabytes.

Source: IDC

Visa Credit Card Transactions (2007)

According to the Visa website, they processed 27.612 billion transactions in 2007. This means an average of 875 credit transactions per second based on a uniform distribution. Assuming that 80% of transactions occur in the 8 hours of the day, this gives an event rate of 2100 transactions per second.

Source: Schultz-Møller et al. (2009)

My data is bigger than your data!

Taken from:
Sign In or Register to comment.