How Global File System Can Help You Manage ...

30 Dec.,2024

 

How Global File System Can Help You Manage ...

Global File System Enables Customers to Access Unstructured Data on Demand

If you are looking for more details, kindly visit zhaoyang.

-06-20

Unstructured data can refer to any data content, including various text files, pictures, videos, and audios. Generally, some unstructured data may have internal structures, but the internal structures are flexible and not proper to be stored in a traditional relational database. In the past decades, different protocols have been developed to access unstructured data for different applications, including S3, HDFS, NFS, SMB, and FTP. There are many types of unstructured data and access protocols. Therefore, customers usually need to deploy different storage products with specific protocols to meet the requirements of different service scenarios. For example, customers need to deploy S3 object storage to support service requirements such as receipt image services, and deploy HDFS big data storage to support analysis services such as Hadoop and Spark. Deploy NFS/SMB storage to support PACS services.

With the development of technologies such as 5G and IoT, unstructured data increases explosively and this makes more and more customers to deploy multiple sets of storage devices to support different services. At the same time, with the wide use of public cloud and private cloud IT infrastructure and the difference in energy consumption costs caused by the imbalance in regional development, new application modes occur, such as data ingested in one region but analyzed in another or data ingested in on premise cluster but analyzed in public cloud. Therefore, the unified access to data across different storage systems and regions becomes the core IT basic capability requirement from most enterprises. OceanStor Pacific's Global File System (GFS) is an advanced feature designed to meet the preceding requirements. It provides unified sharing capabilities for heterogeneous data and cross-region data.

Heterogeneous multi-source data sharing is supported by GFS, accelerating data access.

In the production environment of most customers, different storage products are deployed to support different application scenarios, as shown in Figure 1.

Figure 1: GFS multi-source data sharing, accelerating access


To meet application performance requirements, users may use different acceleration modes:

Method 1: Data in the shared storage is prefetched to the local SSD disk on the computing side in advance. In this method, users need to streamline the entire application process, perform data prefetching in proper steps, and eliminate the cache of the local SSD disk on the computing side. Different applications have different processes and cannot be reused efficiently.

Method 2: Deploy an open-source acceleration layer such as Alluxio to implement automatic data prefetch and elimination by using Alluxio and policies, thereby improving data access speed. The disadvantage is that the open sourceGS does not support POSIX/MP-IO interfaces required by HPC applications, and therefore cannot meet the application requirements.

Method 3: Use the acceleration solution provided by commercial storage. The data acceleration capability from different storage products has its own advantages and disadvantages.

The GFS feature from OceanStor Pacific inherits the capabilities of OceanStor Pacific and supports the standard protocols (NFS/SMB/HDFS/S3/POSIX/MP-IO) for applications. Supports policy-based prefetching for external unstructured storages such as NFS, SMB, HDFS storage, public S3 object storage and 3rd vendor&#;s S3 object storage; also evict cache according to data access hotspots, and quota setting for multi-source heterogeneous data. In addition to data access acceleration for independently deployed OceanStor Pacific storage systems (NFS, SMB, HDFS, or S3), standard NFS, SMB, HDFS, or S3 storage systems provided by third-party storage vendors are also supported.

In terms of data access mode, GFS supports two modes: Read/Write-through and Read/Write Cache. The read/write-through mode is applicable only to the scenario where data needs to be quickly synchronized to the target storage device. The Read/Write cache mode is applicable to scenarios where both read and write data needs to be accelerated. When the changed data is flushed to the target storage device depends on the cache elimination policy and period of the GFS.

GFS supports cross-region data sharing, accelerating access

To meet the requirements of cross-region information sharing and flow, OceanStor Pacific GFS supports heterogeneous data source acceleration and on-demand data flow and sharing, as shown in Figure 2.

Figure 2: GFS cross-domain data sharing and flow


To implement cross-domain data flow, the GFS uses different policies for different to-be-shared target storages. If the target storage source is OceanStor Pacific, deduplicate and compress data before data flow, and decompress and restore data on the target GFS to reduce the amount of cross-domain data to improve data flow efficiency. If the target storage to be moved is not from OceanStor Pacific, deduplication and compression cannot be performed before data transfer. The whole process is shown in Figure 3.

Figure 3: Data flow between GFS domains


All metadata changes from DC1, DC2, public cloud S3 storage or external 3rd vendor's S3 storage will be detected, merged and transferred to other storage in GFS relationship. In order to improve the metadata exchange efficiency, OceanStor Pacific GFS takes two different ways:

The 1st way, for heterogeneous storage systems, e.g. S3 in public cloud or 3rd vendor' s S3 object storage, OceanStor Pacific GFS will scan the folder or bucket in interval(configured by customers), compare new scanned metadata with previous scanned metadata to figure out the changes, fetch the new metadata for these changes, compact these into one meta delta image and transfer it into remote OceanStor Pacific. If the remote storage is not OceanStor Pacific, it will take standard S3 to update the new metadata into remote storage one by one.

The 2rd way, for OceanStor Pacific system, it will record down one log for each metadata change, compact these change logs into one meta image and transfer it into remote OceanStor Pacific storage system.

If the configured cross-region-shared folder is big with millions of files/objects, the 2nd way is more efficient as compared with the 1st way.

The OceanStor Pacific storage system is designed to efficiently store massive data and access massive data. The multi-protocol seamless interworking provides a single-system global namespace, reducing redundant data storage between different protocols. GFS is designed to efficiently access data across different clusters, regions, and heterogeneous data sources for customers.

Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy, position, products, and technologies of Huawei Technologies Co., Ltd. If you need to learn more about the products and technologies of Huawei Technologies Co., Ltd., please visit our website at e.huawei.com or contact us.

Google File System(GFS) vs. Hadoop Distributed ...

In distributed file systems, Google File System (GFS) and Hadoop Distributed File System (HDFS) stand out as crucial technologies. Both are designed to handle large-scale data, but they cater to different needs and environments. In this article, we will understand the differences between them.

Google File System(GFS) vs. Hadoop Distributed File System (HDFS)

What is Google File System (GFS)?

Google File System (GFS) is a distributed file system designed by Google to handle large-scale data storage across multiple machines while providing high reliability and performance.

  • It was developed to meet the needs of Google's massive data processing and storage requirements, particularly for its search engine and other large-scale applications.

  • GFS is optimized for storing and processing very large files (in the range of gigabytes or terabytes) and supports high-throughput data operations rather than low-latency access.

Key Features of Google File System(GFS)

Below are the key features of Google File System(GFS):

  • Scalability

    : GFS can scale to thousands of storage nodes and manage petabytes of data.

  • Fault Tolerance

    : Data is replicated across multiple machines, ensuring reliability even in case of hardware failures.

  • High Throughput

    : It&#;s optimized for large data sets and supports concurrent read and write operations.

  • Chunk-based Storage

    : Files are divided into fixed-size chunks (usually 64 MB) and distributed across many machines.

  • Master and Chunkserver Architecture

    : GFS employs a master server that manages metadata and multiple chunkservers that store the actual data.

What is

Hadoop Distributed File System (HDFS)

?

Hadoop Distributed File System (HDFS) is a open source distributed file system inspired by GFS and is designed to store large amounts of data across a cluster of machines, ensuring fault tolerance and scalability. It is a core component of the Apache Hadoop ecosystem and is designed to handle large-scale data processing jobs such as those found in big data environments.

Key Features of

Hadoop Distributed File System (HDFS)

Below are the key features of Hadoop Distributed File System:

  • Distributed Architecture

    : HDFS stores files across a distributed cluster of machines.

  • Fault Tolerance

    : Data is replicated across multiple nodes, ensuring that the system can recover from failures.

  • Master-Slave Architecture

    : HDFS consists of a single master node (NameNode) that manages metadata and multiple slave nodes (DataNodes) that store actual data.

  • Large Block Size

    : HDFS breaks files into large blocks (default 128 MB or 64 MB) to optimize read/write operations for large datasets.

  • Write Once, Read Many

    : HDFS is optimized for workloads that involve writing files once and reading them multiple times

Google File System(GFS) vs.

Hadoop Distributed File System (HDFS)

Below are the key differences between Google File System and Hadoop Distributed File System:

Aspect

Google File System (GFS)

Hadoop Distributed File System (HDFS)

Origin

Developed by Google for their internal applications.

Developed by Apache for open-source big data frameworks.

Architecture

For more information, please visit GFS Tanks.

Master-slave architecture with a single master (GFS master) and chunkservers.

Master-slave architecture with a NameNode and DataNodes.

Block/Chunk Size

Default chunk size of 64 MB.

Default block size of 128 MB (configurable).

Replication Factor

Default replication is 3 copies.

Default replication is 3 copies (configurable)

File Access Pattern

Optimized for write-once, read-many access patterns.

Also optimized for write-once, read-many workloads.

Fault Tolerance

Achieves fault tolerance via data replication across multiple chunkservers.

Achieves fault tolerance via data replication across multiple DataNodes.

Data Integrity

Uses checksums to ensure data integrity.

Uses checksums to ensure data integrity.

Data Locality

Focus on computation close to data for efficiency.

Provides data locality by moving computation to where the data is stored.

Cost Efficiency

Designed to run on commodity hardware.

Also designed to run on commodity hardware.

Use Cases of Google File System (GFS)

Below are the use cases of google file system(gfs):

  • Web Indexing and Search Engine Operations

    :

    • GFS was originally developed to support Google&#;s search engine.

    • It handles massive amounts of web data (such as crawled web pages) that need to be processed, indexed, and stored efficiently.

    • The system enables fast access to large datasets, making it ideal for web crawling and indexing tasks.

  • Large-Scale Data Processing

    :

    • GFS is used in large-scale distributed data processing jobs where files can be extremely large (gigabytes or terabytes).

    • It supports high-throughput data access, making it suitable for data processing jobs like MapReduce.

    • Google used GFS for data-intensive tasks like search indexing, log analysis, and content processing.

  • Machine Learning and AI Workloads

    :

    • GFS is also employed in machine learning tasks at Google.

    • Since machine learning often involves processing large datasets for training models, GFS&#;s ability to handle large files and provide high-throughput data access makes it useful for machine learning pipelines.

  • Distributed Video and Image Storage

    :

    • GFS is used to store and process large multimedia files, such as videos and images, for Google services like YouTube and Google Images.

    • Its fault tolerance and ability to scale out to handle massive amounts of media make it ideal for these types of workloads.

  • Log File Storage and Processing

    :

    • Large-scale applications generate enormous log files, which can be stored in GFS for future analysis.

    • Google uses GFS to store and analyze logs for various services (e.g., Google Ads, Gmail) to identify trends, detect anomalies, and improve service quality.

Use Cases of Hadoop Distributed File System (HDFS)

Below are the use cases of Hadoop Distributed File System(HDFS):

  • Big Data Analytics

    :

    • HDFS is widely used for big data analytics in environments that require the storage and processing of massive datasets.

    • Organizations use HDFS for tasks such as customer behavior analysis, predictive modeling, and large-scale business intelligence analysis using tools like Apache Hadoop and Apache Spark.

  • Data Warehousing

    :

    • HDFS serves as the backbone for data lakes and distributed data warehouses.

    • Enterprises use it to store structured, semi-structured, and unstructured data, enabling them to run complex queries, generate reports, and derive insights using data warehouse tools like Hive and Impala.

  • Batch Processing via MapReduce

    :

    • HDFS is the foundational storage layer for running batch processing jobs using the MapReduce framework.

    • Applications like log analysis, recommendation engines, and ETL (extract-transform-load) workflows commonly run on HDFS with MapReduce.

  • Machine Learning and Data Mining

    :

    • HDFS is also popular in machine learning environments for storing large datasets that need to be processed by distributed algorithms.

    • Frameworks like Apache Mahout and MLlib (Spark&#;s machine learning library) work seamlessly with HDFS for training and testing machine learning models.

  • Social Media Data Processing

    :

    • HDFS is commonly used in social media analytics to process large-scale user-generated content such as tweets, posts, and multimedia.

    • Social media companies use HDFS to store, analyze, and extract trends or insights from vast amounts of data.

Conclusion

In conclusion, GFS is used only by Google for its own tasks, while HDFS is open for everyone and widely used by many companies. GFS handles Google&#;s big data, and HDFS helps other businesses store and process large amounts of data through tools like Hadoop.



Want more information on glass fused bolted steel tanks? Feel free to contact us.