ORC Tools

Work with Optimized Row Columnar format for big data processing

ORC File Viewer

View ORC file schema, metadata, and statistics

About ORC Format

Optimized Row Columnar (ORC) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to other columnar formats like Parquet and RCFile, but has been optimized for Hadoop workloads.

Key Features:

  • Type-aware: Knows the data type of each column for better encoding
  • Self-describing: Includes metadata about the data structure
  • Compression: Built-in support for ZLIB, Snappy, LZO, and LZ4
  • Indexes: Lightweight indexes including min/max values and bloom filters
  • Streaming: Large files can be read without loading entire file

File Structure:

  • Stripes: Large units of data (default 64MB) for efficient reads
  • Row Groups: 10,000 rows grouped together within stripes
  • Streams: Column data stored in separate streams
  • File Footer: Contains file metadata and schema

When to Use ORC:

  • Hive data warehouses requiring ACID transactions
  • Analytics workloads with selective column reads
  • ETL pipelines needing high compression
  • Spark applications processing large datasets
  • Long-term data archival with query capability