In this repository, we will use three seminar to illustrate how CASD treats large-scale Geospatial data.
There exist many tools and frameworks to work with geospatial data. In this seminar, we only shows postgis and sedona.
Apache Sedona (formerly known as GeoSpark) is a cluster computing system for processing large-scale spatial data.
Sedona extends existing cluster computing systems, such as Apache Spark, with a set of out-of-the-box
distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines.
GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (e.g. Point, Line, Polygon) to Parquet. For more details, you can visit their official website
Compare to other traditional solutions, sedona provides distributed spatial datasets(e.g. RDD, dataframe) and
distributed spatial queries(e.g. range query, range join query, distance join query, K Nearest Neighbor query). This
feature allows sedona to handle large-scale Geospatial data(upto PB).
Sedona provide predefined spatial objects:
- point
- line
- polygone
- multi-line
- multi-polygone
Sedona supports various input formats:
- GeoParquet
- CSV/TSV
- WKT/WKB
- GeoJSON
- GML1,GML2
- KML
- Shapefile
- Geohash
- GeoTIFF
- NetCDF/HDF
- GeoPackage
Sedona supports multiple CRS(Coordinate Reference System) and data conversion between different Spatial Reference System.
Sedona provides rich data visualization api(e.g. pydeck, kepler).
Sedona provides geopandas compatibility.
Sedona supports various language api:
- python
- R
- Java
- Scala
- Pros:
- Easy to deploy and maintain
- SQL-based geospatial data transformation and processing.
- Easy to integrate with traditional GIS tools(e.g. QGIS, GeoServer, ArcGIS).
- Row oriented: better support for transactional data management
- Cons:
- Limited scalability for large datasets(Single-node performance constraints).
- Storage and calculation are strongly coupled(Possible data duplication)
- Pros:
- Can handle large-scale geospatial data processing (designed for distributed computing architecture).
- Complete dissociation of data storage and processing. It supports various file system (e.g. HDFS, S3, etc.)
- Easy to integrate with big data ecosystem (e.g. Hadoop, Flink, etc.).
- Column oriented: better support for column oriented data transformation and analysis.
- Cons:
- Hard to deploy and maintain in cluster mode(Single-node mode is easy to deploy and maintain).
- Not designed for transactional data management.
- Complete dissociation of data storage and calculation.(Zero data duplication).
There are many metadata management tools. Most of them requires a licence, some of them are open source:
- Atlas
- Amundsen
- DataHub
- OpenMetadata
- Marquez
In this seminar, we only compare Atlas and OpenMetadata.
If you are not familiar with the terminology such as data governance, data management and metadata management. Please go to this page.
Apache Atlas is a metadata management and data governance tool, which helps in tracking and managing mutations to dataset metadata. It provides a solution for collecting, processing, storing and maintaining metadata about data objects. It also boasts a rich REST interface for a multitude of operations, such as creating object types, entities insertion and searching.
For more details, you can visit their official website and their github page(1.9k stars)
OpenMetadata is a unified platform for data
- discovery
- observability
- governance
It provides a central metadata repository, in-depth lineage, and seamless team collaboration.
For more details, you can visit their official website and their github page(5.8k stars)
-
Pros:
- Mature metadata management solution(e.g. metadata entities CRUD, data lineage, tag/glossary).
- Supports for custom metadata types creation which can have primitive attributes, complex attributes, object references.
-
Cons:
- Legacy UI and less intuitive interface.
- Lack of automatic metadata ingestion support. It only exists Seven hooks(e.g. HBase, Hive, Storm, etc.).
- Requires significant maintenance outside Hadoop environments.
- Access control policies are hard to set up.
-
Pros:
- Mature metadata management solution(e.g. metadata entities CRUD, data lineage, tag/glossary).
- Supports for custom metadata types creation which can have primitive attributes, complex attributes, object references.
- Access control policies are easy to set up.
- Provides many automatic metadata ingestion support. It provides more than 50 hooks (e.g. MySQL, Postgres, Hive, Spark, etc.).
- Modern UI and allow collaboration between different user role.
-
Cons:
- Requires significant maintenance if hosted on premise.
NiFi is a dataflow automation tool designed for real-time, event-driven data ingestion, routing, transformation, and monitoring.
For more details, you can visit their official website and their github page (5k stars)
Airflow is a workflow orchestration tool designed to programmatically schedule, process, and monitor complex batch ETL workflow.
For more details, you can visit their official website and their github page (38.2k stars)
-
Pros:
- Best for real-time, event-driven dataflows.
- Easy to use with visual, no-code UI.
- Built-in data lineage and provenance tracking.
- Provides a huge amount pre-defined processors(https://nifi.apache.org/components/). No code or low code for simple data integration
-
Cons:
- Not ideal for complex batch workflows. Need to write custom processors with data transformation logic.
- Hard to share data pipelines via template(https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Create_Template).
- Complex configuration for cluster mode (e.g. ZooKeeper) and requires more resources for high-volume dataflows.
Use Nifi, if
- You need event-driven data ingestion and processing.
- Data movement and integration between systems are the main goals.
-
Pros:
- Best for batch data processing and workflow scheduling.
- Highly customizable through Python-based workflows(Directed Acyclic Graph).
- Provides various operators for complex data transformation(e.g. Bash, python, mysql, etc.)
- Support various executor(e.g. Celery, K8s)
-
Cons:
- Not designed for real-time data ingestion.
- Requires python coding skills to define workflows.
Use Apache Airflow, if
- You need complex data transformation with various dependencies.
- You need to handle large scale data sources.
- Workflows are based on scheduled jobs or batch processing.