[Hadoop] HDFS 디렉터리/파일들을 har 파일 하나로 archiving 하는 방법

눈가락 2022. 4. 1. 15:01

2022. 4. 1. 15:01

HDFS 에 디렉터리/파일들이 너무 많아서

NameNode 의 memory 가 위험하다 싶을 때

HDFS 내의 디렉터리/파일들을 하나의 har 파일로 아카이빙하여

디렉터리/파일 개수를 줄일 수 있음

디렉터리/파일 개수가 줄어들면 파일의 메타데이터가 줄어들어서

NameNode 의 memory 를 절약할 수 있음

참고로 HDFS 상의 파일, 디렉터리 개수를 알고 싶다면 -count 옵션을 주면 됨

hdfs dfs -count -h /user/eyeballs/*

결과가 총 4개의 열로 나오는데, 다음과 같은 뜻임.

DIR_COUNT

FILE_COUNT

CONTENT_SIZE

FILE_NAME

따라서, 첫번째 두번째 열의 크기를 통해 어떤 path 에 파일, 디렉터리가 많이 있구나 를 알 수 있음

예제를 들어, har 파일로 아카이빙하는 방법을 설명해 봄

나의 HDFS 에 다음과 같은 파일이 1000개 있음

hadoop fs -ls /user/eyeballs/data

/user/eyeballs/data/1
/user/eyeballs/data/2
...
/user/eyeballs/data/1000

나는 이 1000 개의 파일들을 하나의 har 파일로 묶고 싶음(archiving)

hadoop archive 명령어를 이용하여

/user/eyeballs/data 디렉터리 자체를

test.har 라는 파일로 묶음

hadoop archive -archiveName test.har -p /user/eyeballs data /user/eyeballs/

명령어들을 하나씩 뜯어보자.

hadoop archive : 하둡에서 제공하는 archiving 기능을 실행하는 명령어

map-reduce job 으로 실행됨

-archiveName : 아카이빙 할 디렉터리가 저장될 har 파일의 이름

test.har : 'test.har' 라는 이름으로 아카이빙

-p : 아카이빙 할 디렉터리의 부모 path

여기 예제에서는, /user/eyeballs 가 되는데

왜냐하면 내가 아카이빙 할 디렉터리가 /user/eyeballs/data 이고

data 의 부모 path 는 /user/eyeballs/ 이기 때문

data : 아카이빙 할 디렉터리

/user/eyeballs : 아카이빙하여 생성될 har 파일이 저장될 path

위의 명령어를 실행한 후 ls 명령으로 확인해보면

test.har 파일이 잘 생성된 것을 볼 수 있음

hadoop fs -ls /user/eyeballs/

/user/eyeballs/data
/user/eyeballs/test.har

생성된 test.har 는 파일이 아니라 dir 로 분류됨
test.har 내부로 들어가보면 다음과 같이 나타남

hdfs dfs -ls /user/eyeballs/test.har/

/user/eyeballs/test.har/_SUCCESS
/user/eyeballs/test.har/_index
/user/eyeballs/test.har/_masterindex
/user/eyeballs/test.har/part-0

test.har 내부 아카이빙된 data 디렉터리를 보고싶다?

'har' 스키마를 이용해서 test.har 내부에

내가 아카이빙 한 data 디렉터리를 볼 수 있음

hadoop fs -ls har:///user/eyeballs/test.har

har:///user/eyeballs/test.har/data

test.har 내부의 data 디렉터리도 열어볼 수 있음

hadoop fs -ls har:///user/eyeballs/test.har/data

har:///user/eyeballs/test.har/data/1
har:///user/eyeballs/test.har/data/2
...
har:///user/eyeballs/test.har/data/1000

생성한 test.har 는

HDFS 내의 디렉터리를 지우는 방법과 동일한 방법으로 지울 수 있음

hadoop fs -rm -r /user/eyeballs/test.har

위의 예제에서는 /user/eyeballs/data 하나만을 아카이빙 했는데,

만약 여러 디렉터리를 아카이빙 하고싶다면?

예를 들어 /user/eyeballs 에 data, data2, data3 디렉터리가 있다고 하자.

hadoop fs -ls /user/eyeballs/

/user/eyeballs/data
/user/eyeballs/data2
/user/eyeballs/data3

data, data2, data3 를 모두 아카이빙 하려면,

다음과 같이 아카이빙 할 디렉터리를 위치에 모든 이름을 넣어주면 됨

hadoop archive -archiveName test.har -p /user/eyeballs data data2 data3 /user/eyeballs/

위에서 본 것 처럼,

har 스키마를 통해서 test.har 아카이빙 내부의 data, data2, data3 를 볼 수 있음

hadoop fs -ls har:///user/eyeballs/test.har

har:///user/eyeballs/test.har/data
har:///user/eyeballs/test.har/data2
har:///user/eyeballs/test.har/data3

test.har 로 아카이빙 한 har 파일을 다시 HDFS 에 되돌리는 복구 작업은

hadoop distcp 명령어를 사용하여 진행됨

다음과 같이 test.har 파일이 있다고 하자.

hadoop fs -ls /user/eyeballs/

/user/eyeballs/test.har

위의 test.har 를 복구하려면 아래 명령어를 사용

hadoop distcp har:///user/eyeballs/test.har/ /user/eyeballs/decompress/

마지막 인자값 (/user/eyeballs/decompress/) 에

test.har 내부 디렉터리들이 모두 복구됨

hadoop fs -ls /user/eyeballs/decompress

/user/eyeballs/decompress/data
/user/eyeballs/decompress/data2
/user/eyeballs/decompress/data3

참고로 복구 작업을 마친 test.har는 사라짐

추가)

archive map-reduce 작업의 큐는 다음과 같이 지정 가능

hadoop archive -archiveName test.har -Dmapred.job.queue.name=queue_name -p /user/eyeballs data data2 data3 /user/eyeballs/

크기가 너무 큰 파일/디렉터리를 아카이빙 하는 경우

OOM 에러가 남

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

참고

https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html

https://wikidocs.net/27605

저작자표시 비영리 동일조건 (새창열림)

'Hadoop' 카테고리의 다른 글

HDFS path 에 사용하는 와일드 카드, Globbing (0)	2023.09.13
[YARN] Fair Scheduler 설명 링크 (0)	2022.05.30
[HDFS] 내가 맞닥뜨린 Permission denied 이슈 (2)	2022.01.18
[Hadoop] vm.swappiness 값은 어떻게 해야 할까 (0)	2021.12.10
[Hadoop] 로컬(Standalone), 의사분산(Pseudo Distributed), 완전분산(Fully Distributed) 모드 차이 (0)	2021.11.24

눈가락★

[Hadoop] HDFS 디렉터리/파일들을 har 파일 하나로 archiving 하는 방법

'Hadoop' 카테고리의 다른 글

+ Recent posts

티스토리툴바