눈가락★ :: 눈가락★

전체 글

[Hive] 기술 질문 대비 적어두는 것들

눈가락 2025. 3. 1. 15:02

2025. 3. 1. 15:02

Apache Hive는 Hadoop 기반의 데이터 웨어하우스 시스템으로,
HDFS, S3 등에 저장된 대량의 데이터 대상으로 SQL을 사용할 수 있도록 도와줌

Hive 는 데이터를 따로 저장하는 storage 가 아님, 다른 저장소(HDFS, S3) 에 있는 데이터 기반의 작업을 실행하는 것임
(근데 그 작업이 쿼리를 실행하는 작업)

쿼리를 실행 할 때는 다음과 같은 요소들을 사용
- Hive Query Engine 에서 HiveQL로 작성된 쿼리를 실행 계획으로 변환하고 실행
- Execution Engine Hive 에서 쿼리를 실제 실행 (MapReduce, Tez, Spark 등의 엔진 사용)

Hive 가 HDFS 내의 데이터를 읽으려면
HDFS 내의 데이터가 규칙적인 path(by partitioning) 및 파일 포맷(Orc, Parquet, Text 등)을 갖고 있어야 함
HDFS 의 데이터에 대한 메타 정보(데이터베이스, 테이블, 컬럼, 파티션, 파일 포맷, 데이터 위치 등)는
Hive Metastore 에 구조화된 형태로 저장됨
테이블 정의와 파티션 정보도 메타스토어에 저장됨.
테이블이 HDFS 내의 어느 디렉토리에 저장되어 있는지, 각 파티션이 어떤 경로에 위치하는지 등의 정보가 메타스토어에 기록됨

create external table 을 사용하여 테이블을 만들고 location 을 설정했다면,
사용자가 hive 를 통하지 않고 직접 HDFS 에 데이터를 넣었어도 Hive 에서 해당 location 에 접근해서
(create external table 실행할 때 미리 생성해둔 ) metadata를 기반으로 데이터를 읽을 수 있음(schema-on-read)
다시 말해, 테이블이 Hive Metastore에 존재하면, 해당 테이블의 메타데이터를 통해 HDFS에서 실제 데이터를 찾을 수 있기 때문에
사용자가 직접 데이터를 HDFS 에 저장했더라도 쿼리가 가능함

Hive 에서 처리하는 "데이터베이스", "테이블" 이라는 개념은, HDFS 데이터를 구분하고 관리하기 위한 논리적인 개념임
(실제 HDFS 에 "데이터베이스", "테이블"이 있는 것이 아님)
실제 HDFS 의 directory 가 Hive 에서 바라볼 때 "데이터베이스"가 됨
그리고 데이터베이스 dir 아래 위치한 dir 혹은 파일이 "테이블"이 됨

Hive는 배치 처리 시스템이기 때문에, 실시간 데이터 분석에는 부적합함.
Presto 같은 실시간 쿼리 엔진 사용하면 됨
그리고 Hive는 기본적으로 Append-only 모델이기 때문에, 데이터를 삭제하거나 업데이트하는 것이 어려움.
ACID Transactions(Hive 3.0 이상) 또는 Hudi, Iceberg 같은 테이블 포맷 사용하여 해결

하이브에서 작성된 쿼리는 일련의 job(MR, spark, etc)으로 변환되어 실행됨

하이브는 HDFS 에 저장된 데이터에 스키마를 입히는 방식으로 데이터를 테이블로 구조화 (읽기 스키마)

테이블 스키마, 파티션 같은 메타데이터를 메타스토어라 불리는 DB(MySQL, derby 등) 에 저장
메타스토어가 로컬 머신에 있으면, 한번에 한명의 유저만 사용 가능(derby DB를 사용하지 말아야 하는 이유)
실제 운영 환경에선 공유 원격 메타스토어를 사용

Hive 는 HADOOP_HOME 환경변수가 설정되어 있다면 그 정보를 이용하여 HDFS 접근 및 사용 가능함
hive 가 hdfs-site.xml 까지 도달 가능한 classpath 를 알고 있다면
hdfs-site.xml 에서 Name Node 주소와 포트 등의 정보를 토대로 HDFS 접속 정보를 얻음
이렇게 획득한 HDFS 접속 정보를 기반으로, Hadoop HDFS 클라이언트 라이브러리(Hadoop의 Java API)를 사용하여 HDFS에 직접 접근함
HIVE_HOME/conf/hive-env.sh 에 HADOOP_HOME 을 추가하여, hive 가 hadoop 정보를 얻을 수 있도록 함

'hive' 명령어를 사용하면, 사용자 머신에 메타스토어 데이터베이스를 만듦
hive 명령어 실행한 위치에 metastore_db 라는 디렉터리 만들어 그 안에 필요한 파일 저장

local 데이터를 읽고 local 에 저장할 수 있음
저장시 데이터를 변경하지 않고 그대로 저장
local 에 저장할 때 하이브가 저장하는 경로는 웨어하우스 디렉터리(기본값 /user/hive/warehouse)

하이브는 특정 테이블 질의할 때 모든 파일을 다 읽음
그래서 효율적인 쿼리를 위해 버켓팅, 파티셔닝이 필요함

하이브 속성 설정 우선순위
1. hive> 에 넣는 SET 명령어
2. 명령어에 넣는 -hiveconf 옵션
3. *-site.xml
4. *-default.xml

Hive 쿼리 실행 엔진은 MR, Tez, Spark 를 지원
개발자가 Hive 에 실행한 '쿼리'는 각 '실행 엔진' 에 맞는 형태로 변경된 후 실행 엔진이 실질적으로 작업을 수행함
- 개발자가 작성한 HiveQL 쿼리 파싱 및 분석
- 분석한 내용을 토대로 논리적인 실행 계획 생성
- Hive의 최적화 엔진을 통해 규칙 기반 및 비용 기반 최적화를 거침 (조인 순서 변경, 조건 푸시다운, 파티션 필터링 등)
- 최적화된 논리적 실행 계획은 최종적으로 선택된 실행 엔진 (MapReduce, Tez, Spark 등)이 이해하고 실행할 수 있는 물리적인 실행 계획으로 변환.
예를 들어 MR 의 경우, ResourceManager에 Job 제출하고, NodeManager에서 Map 및 Reduce 태스크를 실행. 여러 단계의 Map 및 Reduce 작업으로 구성된 Job으로 변환된 후, MR 작업 실행
Spark의 경우, Master 노드에게 애플리케이션을 제출하고, Worker 노드에서 RDD 연산 분산 방식으로 수행. RDD (Resilient Distributed Dataset) 연산 그래프(DAG)로 변환 후, 다양한 트랜스포메이션과 액션을 통해 작업 실행

hive-site.xml 에 아래와 같이 실행 엔진을 변경할 수 있음
-
        <property>
                <name>hive.execution.engine</name>
                <value>tez</value>
        </property>

혹은 아래와 같은 set 명령어로 설정 가능
- SET hive.execution.engine=spark;

SparkSQL 이 존재함에도 불구하고 Hive 를 사용하는 이유

- 이미 시스템에서 Hive 를 사용하고 있기 때문에
- Hive metastore 를 범용적으로 사용하기 위해
Hive 가 관리하는 metastore 에는 테이블 스키마, 파티션 정보, 데이터 위치 등의 메타데이터가 포함되어있는데
이 정보를 (SparkSQL 포함하여) Hadoop Ecosystem 들이 필요로 하여 Hive metastore 를 (메타 데이터를 공유하면서) 사용함

하이브 서비스

- cli : hive 명령어를 통해 실행. terminal 에서 곧바로 hiveQL을 사용할 수 있음
HDFS 및 metastore 에 직접 접근함. 단일 사용자만 사용 가능. 접근 권한 확인 하지 않아서 되도록 hiveserver2 사용

- hiveserver2 : hiveserver2 명령어를 통해 실행
다른 언어로 개발된 클라이언트와 연동 가능하도록 하이브 쓰리프트 서비스 실행
hiveserver2 로 thrift RPC 서비스가 실행되고,
이를 통해 다른 언어로 개발된 다른 프로그램에서 thrift, JDBC, ODBC 연결자로 하이브에 연결 가능(기본 포트 10000번)
즉, 다른 프로그램 이를테면 Tabealu 가 (client로서) hiveserver2 를 이용하여 hive 를 사용할 수 있음
(hiveserver2 는 마치 문지기, API 같은 느낌)
hiveserver2 를 이용하면, 여러 클라이언트가 동시에 Hive에 접속하여 쿼리를 실행할 수 있음
다른 프로그램에서 접근하기 위해 hiveserver2 는 항시 데몬으로 동작중이어야 함.
접근 권한 확인 등을 hiveserver2 에서 진행

- beeline : beeline 명령어를 통해 실행
JDBC 로 hiveserver2 프로세스에 접근하는 명령행 인터페이스
즉, beeline 이 client 로서 hiveserver2 를 통해 hive 에 접근함. 그리고 사용자 행동(쿼리를 날리는 등)을 하겠지.

- hwi : 하이브 웹 인터페이스

- 메타스토어 : Hive 테이블의 스키마, 위치, 파티션 정보 등의 메타데이터를 중앙 집중식으로 관리하는 서비스
기본적으로 메타스토어는 하이브 서비스와 동일한 프로세스에서 실행됨
메타스토어는 (HiveQL 과 상관없는) 독립적인 프로세스에서 실행됨
Hive CLI, HiveServer2, SparkSQL 등 다양한 Hive 관련 컴포넌트들이 Metastore에 접근하여 메타데이터를 공유하고 이용함
SPOF 이므로, HA 가 필수임.

메타스토어는 하이브 메타데이터를 저장하는 핵심 저장소
메타스토어는 서비스와 데이터 보관 저장소(DB)로 나뉨

- 메타스토어 서비스 (API) : Hive 클라이언트 (CLI, HiveServer2, SparkSQL 등)가 메타데이터에 접근할 수 있도록 인터페이스를 제공하는 서비스. 클라이언트는 직접 메타스토어 데이터베이스에 연결하는 대신 메타스토어 서비스를 통해 필요한 메타데이터를 요청하고 응답받음.
Thrift 프로토콜을 사용하여 client 가 메타스토어서비스로 접근할 수 있는 API 제공함

- 메타스토어 저장소 (DB) : Hive 테이블의 스키마 (컬럼 이름, 데이터 타입 등), 데이터 위치 (HDFS 경로), 파티션 정보, 속성 등 실질적인 메타데이터를 저장하는 영구적인 저장소

< 내장 메타스토어 (embedded metastore) >

- 메타스토어 서비스 : 하이브 서비스(beeline, hiveserver2 등)가 실행하는 동일한 JVM 내에서 실행됨
- 메타스토어 저장소(DB) : 하이브 서비스(beeline, hiveserver2 등)가 실행하는 동일한 JVM 내에서 실행됨
local 에 데이터 저장하는 derby 데이터베이스가 사용됨
Hive CLI, HiveServer2를 실행하는 JVM 프로세스가 시작될 때 Derby도 함께 시작되고,
이 프로세스가 종료될 때 Derby 데이터베이스도 함께 종료됨 (생애주기를 함께 함)
만약 Hive 서비스를 운영하는 서버가 예기치 않게 다운되거나 재시작되는 경우,
메모리 상의 변경 사항이 디스크에 완전히 기록되지 않았을 가능성이 있으며, 이는 데이터 손실로 이어질 수 있음.
(다 저장 못 했는데 서버가 다운되면, 저장 진행중이던 데이터가 날아갈 수 있다는 말임)
derby 는 한 번에 db파일 하나에만 접근 가능해서 하나의 하이브 세션만 사용 가능 (다른 세션에서 beeline 이 hive 접근시 오류 발생...)
(다른 사용자 등에 의해) 두 번째 세션 사용 불가
따라서 다중 세션(다중 사용자) 지원 불가

< 로컬 메타스토어 (Local Metastore) >

- 메타스토어 서비스 : 하이브 서비스(beeline, hiveserver2 등)가 실행하는 동일한 머신에서 실행됨
- 메타스토어 저장소(DB) : 별도의 머신에서 별도의 프로세스로 실행되는 데이터베이스. mysql, postgresql etc
다중 세션(다중 사용자) 지원 가능
mysql, postgresql 등을 원격 데이터베이스로 사용
hive-site.xml 파일에서 데이터베이스 연결 정보(JDBC URL, 드라이버 클래스, 사용자 이름, 비밀번호 등)를 설정해야 접근 가능함

< 원격 메타스토어 (Remote Metastore) >

- 메타스토어 서비스 : 하나 이상의 메타스토어 서비스가, 하이브 서비스와는 별도의 독립적인 JVM 프로세스로 실행되는 서버에서 운영됨
- 메타스토어 저장소(DB) : 별도의 머신에서 별도의 프로세스로 실행되는 데이터베이스. mysql, postgresql etc
hive 클라이언트(cli, hiveserver2, spark 등)와 메타스토어 서버는 thrift 프로토콜을 사용하여 통신함
hive-site.xml 파일에서 원격 메타스토어 서버의 URI(hive.metastore.uris)를 설정해야 접근 가능함
여러 대의 메타스토어 서버를 구성할 수 있기 때문에, 로드밸런싱이나 고가용성, 확장성 등 확보 가능!
데이터베이스 서버를 완전히 방화벽 뒤에 숨기고, Hive 클라이언트는 메타스토어 서버를 통해서만 메타데이터(DB)에 접근하므로
보안이 강화됨. 클라이언트는 데이터베이스 자격 증명을 가질 필요가 없음.
DB 서버가 3대 클러스터로 구성되어있고, metastore server 가 한 대 있다고 하자.
DB 서버는 metastore server 만 접근 가능하도록 설정되어 있어서 다른 client 의 접근이 차단됨
metastore server 로 접근은 오로지 hive clients 만 가능하도록 설정해둠
이런 식으로 hive client 가 직접 DB 에 접근하는 것을 방지하고, 메타스토어 서버라는 중개자(API)를 통해 접근하도록 하여 보안 강화!

rdb 와 hive 차이
- rdb 는 쓰기 스키마, hive 는 읽기 스키마
쓰기 스키마는 index 를 지원해서 쿼리가 빠르지만,
읽기 스키마는 디스크 직렬화가 필요없어서 데이터 write 가 매우 빠름
- rdb 의 트랜잭션, 색인 기능은 hive 에서 일부 지원
hive 는 기본적으로 데이터 update 를 못 함
트랜잭션(update)이 활성화된 hive 는 update 가 가능하지만,
실제 테이블 내 데이터가 업데이트를 하는 건 아니고 update 내역을 별도의 작은 델타 파일로 저장함
- hive 의 insert into : 하이브 테이블, 파티셔닝 된 테이블 내에 데이터를 추가하며 insert 함
insert overwrite : 하이브 테이블, 파티셔닝 된 테이블 내의 데이터를 모두 지우고 insert 함
- 데이터를 읽을 때 SHARED_READ lock 을 얻는데,
해당 lock 이 존재하면 다른 사용자가 읽기 가능, update 불가능
- 데이터를 update 할 때 EXCLUSIVE lock 을 얻는데,
해당 lock 이 존재하면 다른 사용자가 읽기/update 가 불가능

hive 가 지원하는 색인
- 콤팩트 색인 : HDFS 블록 넘버로 색인
- 비트맵 색인 : 특정 값이 출현하는 행을 효율적으로 저장하기 위해 압축된 비트셋으로 색인
색인을 위한 데이터는 별도의 테이블에 저장됨

하이브 테이블의 데이터 저장소는 local disk, s3, HDFS 등 다양하게 사용 가능

관리 테이블 : 하이브가 데이터를 직접 관리
직접 관리한다고 해도, 데이터는 여전히 HDFS 등 외부 저장소에 있음
직접 관리한다는 의미는, 테이블이 삭제되었을 때 데이터가 실제로 삭제된다는 의미임
load 쿼리 사용시, 해당 데이터가 웨어하우스 디렉터리(local, HDFS 등)으로 이동
drop table 쿼리 사용시, 해당 데이터와 메타데이터가 실제로 삭제

외부 테이블 : 하이브가 데이터를 직접 관리하지 않음
drop 쿼리 사용시, 메타데이터만 삭제되고 데이터는 삭제되지 않음

< 파티션 >
데이터를 각 dir 에 나눠 저장. PARTITIONED BY
year=2024/month=01/ 같은 구조로 HDFS에 저장됨.

< 버킷 >
지정한 컬럼값을 해쉬 처리 한 후, 버킷수로 나눈 나머지만큼 파일로 나눠 저장. dir 가 아닌 파일에 저장. CLUSTERED BY

버킷을 사용하는 이유
- 매우 효율적인 쿼리가 가능
테이블에 대한 추가 구조를 부여하게 되고, 쿼리 수행 시 이 추가 구조 사용 가능
- 효율적인 샘플링에 유리
- 버켓팅한 테이블은 조인시에 SMB(sort merge bucket) 조인으로 처리하여 속도 향상

row format : 행 구분자 설정, 특정 행의 필드가 저장된 방식 설정

- 지정가능한 구분자
  FIELDS TERMINATED BY '\t'            -- 칼럼을 구분하는 기준
  COLLECTION ITEMS TERMINATED BY ','   -- 리스트를 구분하는 기준
  MAP KEYS TERMINATED BY '='           -- 맵데이터의 키와 밸류를 구분하는 기준
  LINES TERMINATED BY '\n'             -- 로(row)를 구분하는 기준
  ESCAPED BY '\\'                      -- 값을 입력하지 않음
  NULL DEFINED AS 'null'               -- null 값을 표현(0.13 버전에서 추가)

- 특정 행의 필드 저장 방식 : 데이터 저장시 SerDe 를 통해 직렬화하여 저장하고 읽을 때 역직렬화하여 읽나 봄
기본서데, 정규식(RegExSerDe), JSON(JsonSerDe), CSV(OpenCSVSerde)가 존재함

"나는 이 테이블에 데이터를 저장할 때 지정된 SERDE를 사용하여 데이터를 직렬화하고, 이 테이블에서 데이터를 읽을 때도 동일한 SERDE를 사용하여 데이터를 역직렬화하겠다!" 라는 의미임

stored as : 데이터를 저장하는 파일 포맷 지정
저장 포맷은 TEXTFILE, SEQUENCEFILE, ORC, PARQUET 등이 존재
바이너리인 sequence, orc, parquet 등은 행의 형식이 바이너리 포맷에 따라 결정되므로 row format 지정 불필요

참고 https://wikidocs.net/23469

hive 는 읽기 스키마를 사용하기 때문에
테이블의 이름 변경, 테이블의 정의 변경, 새로운 컬럼 추가 등이 자유로움

셔플 조인은 가장 느린 조인

맵(Map) 단계에서 각 테이블을 읽고, 조인 컬럼을 키로 설정하여 셔플

리듀서로 데이터가 이동되고 리듀서에서 테이블을 조인

버켓팅 되어있으면 Bucket Map Join 이 빨라짐.

https://data-flair.training/blogs/bucket-map-join/

join 의 기준이 되는 key 들이 모두 버케팅 되어있는 상황에서 Join 을 진행하면,

작은 테이블의 버킷들(Table a, Table c)이 큰 테이블의 버킷(Table b)의 메모리에 모두 복사됨

이렇게되면 join 에 필요한 모든 key 가 하나의 Mapper 에서 접근 가능하기 때문에 join 속도 향상

작은 테이블 크기가 메모리에 올라갈 정도로 작아야 함

브로드캐스트 조인임.

Sort Merge Join 은 조인 테이블이 버켓팅 되어 있을 때 사용 가능

버켓팅된 키의 정보를 이용하여 빠르게 조인

다음 절차로 join 이 진행됨

- Table a 와 Table b 에서 join 에 필요한 데이터를 읽음

- 별도의 공간에서 읽은 데이터를 정렬sort 함

- 정렬된 데이터를 기준으로 join 함

참고 : https://coding-factory.tistory.com/757

hive 명령어와 beeline 명령어 차이?

- hive 명령어는 하이브 옵션 설정, 쿼리 실행 등이 가능

- beeline 은 단지 hive 에 thrift 로 접근하는 인터페이스

< thrift >

다양한 언어로 개발된 소프트웨어들을 쉽게 결합(통신)하는 프레임워크. 내부에서 RPC 사용

서로 다른 프로그래밍 언어로 작성된 서비스들이 효율적으로 통신할 수 있도록 도와줌

(클라이언트와 서버가 서로 다른 언어로 구현될 수 있음)

언어 간 상호 운용성(Cross-language interoperability)과 고성능 서비스 호출에 중점

데이터베이스 연결보다는 서비스 간의 통신에 더 적합함

HiveServer2 가 클라이언트와의 통신을 위해 Thrift API 를 사용함

< JDBC >

- Java Database Connectivity

- 오직 자바(Java) 언어를 위한 표준 데이터베이스 연결 API

- JAVA 언어로 DB 에 접근해 DML 쿼리 하기 위한 인터페이스(API)

- Java와 연동되는 DBMS에 따라 그에 맞는 JDBC를 설치할 필요가 있음

각 데이터베이스 벤더(Oracle, MySQL, PostgreSQL, SQL Server 등)는 자신들의 데이터베이스에 특화된 JDBC 드라이버를 제공하며, 이 드라이버는 JDBC 인터페이스를 구현한 것임

즉, 각 DB 제품들마다 JDBC 를 지원하는 자신만의 드라이버를 구축해두었음(이 드라이버는 관리자가 받아서 설치해야 함)

하지만 각 DB 에 연결하기 위해 각 DB 가 구축해둔 드라이버에 따라 접근 방법을 바꿀 필요는 없음

클라이언트(자바 애플리케이션)는 그냥 표준 JDBC 인터페이스를 사용하여 DB 연결 코드를 작성하면 됨

실제 DB 에 연결되는 과정에서, 각 DB 가 구축한 JDBC 드라이버가 담당하여 자신의 고유한 네트워크 포로토콜로 변환한다고 함

< ODBC >

- Open Database Connectivity

- 응용프로그램에서 다양한 DB 에 접근해 DML 쿼리하기 위한 인터페이스

- 접속처의 데이터베이스가 어떠한 DBMS에 의해 관리되고 있는지 의식할 필요가 없음

애플리케이션이 어떤 프로그래밍 언어로 작성되었든 (C 언어 바인딩이 가능한 언어라면)

다양한 데이터베이스 관리 시스템(DBMS. MySQL, PostgreSQL... 등등)에 접근할 수 있는 공통적인 방법 제공

(JDBC보다 더 넓은 언어 범위를 커버)

(JDBC 와 마찬가지로) 각 데이터베이스 벤더는 자신들의 데이터베이스에 특화된 ODBC 드라이버를 제공

(관리자는, 사용하려는 데이터베이스(MySQL, PostgreSQL, Hive)에 맞는 ODBC 드라이버를 운영체제에 미리 설치해야 함)

(JDBC 와 마찬가지로) 클라이언트(애플리케이션)는 표준 ODBC API 함수(인터페이스)를 사용하여 DB 연결 코드를 작성

실제 DB 에 연결되는 과정에서, 애플리케이션의 연결 호출은 먼저 ODBC 드라이버 관리자에게 전달됨
드라이버 관리자는 애플리케이션의 요청을 파싱하여, 어떤 특정 ODBC 드라이버가 필요한지 식별

그리고 해당 드라이버를 로드하고, 애플리케이션의 요청을 로드된 드라이버에게 중계(forward)함
드라이버 관리자로부터 연결 요청을 받은 해당 데이터베이스의 ODBC 드라이버는 이 요청을 받아서,

자신에게 맞는 데이터베이스의 고유한 네트워크 프로토콜로 변환하여 실제 데이터베이스와 통신함

개발자는 그냥 표준 ODBC API 를 사용하면 됨.

hive 에서 orc 를 사용하는 이유

- 높은 압축률

- 컬럼 기반 포맷이라 처리 속도가 빠름

- 스키마를 가지고 있음

- orc 는 hive 에 최적화되어있고, parquet 은 spark 에 최적화 되어있음

ORDER BY vs DISTRIBUTE BY vs SORT BY vs CLUSTER BY

ORDER BY

- 매퍼에서 나오는 모든 데이터를 하나의 리듀서로 몰아서 정렬 수행

- 리듀서가 하나 뿐이라, 저장되는 파일도 하나

- limit 을 추가하여 리듀서에 부하를 줄이는 게 좋음

- order by COLUMN

SORT BY

- ORDER BY 와 다르게, 리듀서가 하나 이상, 각 리듀서에서 정렬 수행

- 각 리듀서별로 정렬하기 때문에, 모든 리듀서 결과는 전체 정렬 되어있지 않음

- 리듀서 개수를 1로 설정하면 ORDER BY 와 같은 결과

- sort by COLUMN

DISTRIBUTE BY

- distributed by 의 대상이 되는 컬럼의 값 기준으로 group 지어 하나의 리듀서에 들어감

- 정렬 수행하지 않음

- 예)

정렬 대상 : a a b c c c d

리듀서 1) a d a

리듀서 2) c b c c

(리듀서 개수와 상관 없이) 같은 값은 모두 하나의 리듀서에 몰려 있음.

CLUSTER BY :

- distributed by 와 sort by 를 같이 사용

- 즉, distributed by 실행하며 정렬까지 진행

- 예)

정렬 대상 : a a b c c c d

리듀서 1) a a d

리듀서 2) b c c c

참고) https://saurzcode.in/2015/01/hive-sort-order-distribute-cluster/

Hive 정적 파티션 vs 동적 파티션

정적 파티션

- 데이터의 테이블에 파티션 값을 직접 입력

예)

INSERT INTO TABLE tbl(yymmdd='20220625')
SELECT name FROM temp;

> hdfs://hive/tables/yymmdd=20220625/

동적 파티션

- 데이터의 컬럼값을 기준으로 파티션이 생성됨

- 쿼리 시점에 어떤 데이터가 어떤 파티션에 가는지 모름

- 동적 파티션은 느림

예)

INSERT INTO TABLE tbl(yymmdd)
SELECT name, yymmdd FROM temp;

> hdfs://hive/tables/yymmdd=20220625/

> hdfs://hive/tables/yymmdd=20220626/

> hdfs://hive/tables/yymmdd=__HIVE_DEFAULT_PARTITION__/

HDFS 에서 작은 파일들 합치기

https://gyuhoonk.github.io/hive-merge-query

결론

작은 파티션들이 많으면, HDFS IO 도 많아지고 NN memory 에 부담도 커짐

Hive

- insert into 등의 쿼리는 매퍼만 동작하는데,

매퍼에서 읽은 데이터를 그대로 HDFS 블럭으로 저장하기 때문에 블럭 개수가 늘어남

이를 합쳐주기 위해 (sort by 등을 추가하여) 리듀서를 추가

Spark

- 셔플 파티션은 기본값이 200 이라서, 셔플 후에는 200개의 작은 파티션들이 생성되어 HDFS 에 저장됨

이를 합쳐주기 위해 셔플 파티션 값을 조정하거나, repartition 혹은 coalesce 를 사용하여 파티션 개수 줄임

< Presto >

Apache Presto 는 Hive 와 마찬가지로, 대규모 데이터에 대한 SQL 기반 쿼리 처리 시스템
Hive 처럼 외부 스토리지(HDFS, S3)에 저장된 데이터 대상으로 쿼리를 사용할 수 있도록 도와줌

하지만 Hive와 다른 목적과 작동 방식을 갖음

- 목적
- Hive : batch 쿼리 처리
- Presto : realtime 분석 쿼리 처리
- 실행 엔진
- Hive : MR, Tez, Spark
- Presto : 자체 엔진(MPP:massively parallel processing)
- 처리 방식
- Hive : 디스크 기반
- Presto : in-memory 기반
- 속도
- Hive : 대용량 데이터 처리하므로 느림
- Presto : 빠름 (...)
- 데이터 소스
- Hive : HDFS, ORC, Parquet
- Presto : HDFS, ORC, Parquet, MySQL, Kafka, Cassandra 등 다양하게 지원
- 사용 사례
- Hive : 대량 데이터 ETL, 데이터 웨어하우스에서 OLAP 에 사용
- Presto : BI, 실시간 대시보드, 쿼리 성능이 중요한 환경, 대화형 분석 환경 제공

Presto 로 대용량 데이터 ETL 혹은 배치 처리하기는 부적합함.
이런 경우는 Hive 를 사용하고, 실시간 분석 처리할 때 Presto 사용

Hive 와 Presto 를 혼합하여 사용하기도 함
이를테면, 대량의 데이터를 HDFS 에 적재할 때는 Hive 를 사용하고,
데이터를 조회할 때는 Presto 를 사용하는 식.

< Iceberg >

Apache Iceberg 는 대규모 데이터 레이크에서 테이블을 효율적으로 관리하기 위해 설계된 테이블 포맷임 (데이터 포맷 아님)
Iceberg는 데이터 레이크를 데이터웨어하우스처럼 활용할 수 있게 해줌
HDFS, S3 등 다양한 스토리지에서 대량의 데이터를 안정적으로 관리하고 빠르게 쿼리할 수 있도록 최적화 됨

Iceberg 의 장점은 다음과 같음
- 대용량 데이터(수십~수백 페타바이트) 처리 최적화
- ACID 트랜잭션 지원 (데이터 일관성 보장)
(ACID : 원자성(Atomicity), 일관성(Consistency), 격리성(Isolation), 지속성(Durability) )
- Schema Evolution 지원 (스키마 변경 가능)
- Partition Evolution 지원 (파티셔닝 변경 가능)
- Time Travel 지원 (과거 데이터 조회 가능)
- 다양한 엔진(Hive, Spark, Trino, Flink, Presto)과 호환

Iceberg 테이블에 저장되는 데이터는 Parquet, ORC 같은 포맷을 갖음
이 데이터 파일들은 한 번 생성되면 절대 변경되지 않음 (심지어 실제 데이터 삭제도 안 됨)
업데이트, 삭제 등은 새로운 파일을 생성하는 것으로 (논리적으로) 처리됨
그리고 테이블에 추가된 데이터들 대상으로 메타데이터(데이터 파일의 경로, 파티션 정보, 컬럼별 통계(최소/최대 값, Null 개수 등), 삭제 정보 등)를 미리 갖추고 있음

예를 들어보자

내가 Iceberg 테이블을 처음 생성했다고 하자
이 Iceberg 테이블 내 아직 데이터 파일은 없음

/user/hive/warehouse/my_sales/metadata/ 경로에 초기 테이블 메타데이터 파일(예: v1.metadata.json)이 생성
이 파일은 비어있는 초기 스냅샷을 가리킴

/user/hive/warehouse/my_sales/
└── metadata/
└── v1.metadata.json <- '아무 데이터도 없이 빈 현재 테이블' 상태을 나타내는 스냅샷

.current 포인터가 v1.metadata.json을 가리키게 됨
내가 select 쿼리를 실행해도 (v1 스냅샷을 기준으로) 데이터가 없기 때문에 아무것도 출력되지 않음

내가 어떤 데이터를 (Parquet 포맷으로) 삽입함 (insert into)
part-00000-abc.parquet 이라는 이름으로 저장됨

/user/hive/warehouse/my_sales/
├── data/
│   └── year=2023/month=01/day=01/ <- 파티션은 (내가 미리 설정한 것 기반으로) 자동 생성됨
│       └── part-00000-abc.parquet <- 내가 저장한 데이터
└── metadata/
    ├── v1.metadata.json
    ├── snap-1-m0.avro    <- '내가 저장한 데이터(part-00000-abc.parquet)' 의 메타데이터(경로, 통계 등)를 갖는 menifest 가 생김
    ├── snap-1-ml.avro    <- 매니페스트 파일(snap-1-m0.avro) 정보가 담긴 menifest 가 생김
    └── v2.metadata.json    <- '내가 저장한 데이터 하나가 존재하는 테이블' 상태를 나타내는 스냅샷이 새로 생김

.current 포인터가 v2.metadata.json을 가리키게 됨
내가 select 쿼리를 실행하면 (v2 스냅샷을 기준으로) 데이터(part-00000-abc.parquet) 가 출력됨

내가 아까 넣은 데이터를 지우는 쿼리를 실행함 (delete from MYTABLE where id = 1;)

/user/hive/warehouse/my_sales/
├── data/
│   └── year=2023/month=01/day=01/
│       ├── part-00000-abc.parquet <- 실제 데이터는 삭제되지 않음
│       └── deleted-rows-xyz.avro <- 논리적으로 삭제되었다는 마크가 추가됨. (records id=1 was deleted 같은 내용이 있음)
└── metadata/
    ├── v1.metadata.json
    ├── snap-1-m0.avro
    ├── snap-1-ml.avro
    ├── v2.metadata.json
    ├── snap-2-m0.avro             (Manifest File: lists part-00000-abc.parquet AND deleted-rows-xyz.avro)
    ├── snap-2-ml.avro             (Manifest List: lists snap-2-m0.avro)
    └── v3.metadata.json          <- '삭제가 진행된 테이블' 상태를 나타내는 스냅샷이 새로 생김

Iceberg는 최신 스냅샷(v3)을 참조하여 part-00000-abc.parquet를 읽긴 하지만...
deleted-rows-xyz.avro를 참조하여 id=1인 레코드를 최종 결과에서 제외
사용자 입장에서는 '쿼리 결과를 보니 데이터가 삭제되었네?' 처럼 보임

업데이트 하는 상황을 보자.
UPDATE my_sales SET amount = 120 WHERE id = 1; 같은 쿼리를 실행하면
Iceberg 는 update 를 '기존 데이터 삭제' + '새 데이터 추가' 로 해석하고 처리함
즉, (delete from 처럼) id=1 인 데이터 삭제하고, id=1 amount=120 인 새로운 데이터를 추가함

/user/hive/warehouse/my_sales/
├── data/
│   └── year=2023/month=01/day=01/
│       ├── part-00000-abc.parquet <- 아직도 연명중
│       ├── deleted-rows-xyz.avro <- 지웠던 흔적도 고스란히 남아있음
│       └── part-00001-def.parquet <- 업데이트하면서 새로 추가된 새로운 데이터
└── metadata/
    ├── ... (previous metadata files)
    ├── snap-3-m0.avro             (Manifest File: lists old valid, new delete, new insert files)
    ├── snap-3-ml.avro
    └── v4.metadata.json           <- '업데이트 한 데이터가 포함된 테이블' 상태를 나타내는 스냅샷이 새로 생김

.current 포인터가 v4 스냅샷을 바라보게 됨
사용자는 이 스냅샷 상태의 테이블을 기준으로 쿼리하게 되므로,
업데이트하여 새로 추가된 데이터를 select 할 수 있음

기존 데이터 파일을 변경하지 않고(불변) 메타데이터만 업데이트하여 컬럼 추가/삭제/이름 변경 등을 안전하게 수행
기존 데이터 파일이 남아있으니, 과거 스냅샷을 참조하여 과거 특정 시점의 데이터 상태를 쿼리 가능!
문제가 발생했을 때도 과거 스냅샷 참고하여 테이블을 이전으로 쉽게 되돌릴 수 있음(롤백 가능)

Iceberg 테이블에 데이터를 쓰는 시스템, Spark, Hive 등이 주체가 되어 iceberg 스냅샷을 추가하고, 매니페스트를 추가함.
예를 들어 Spark 에서 iceberg 테이블에 데이터를 쓴다고 가정해보자

Executor 내 모든 Task가 데이터 파일/삭제 파일 작업을 완료하면,
Spark Driver는 새로 생성된 이 파일들에 대한 메타데이터(경로, 통계, 파티션 정보 등)를 수집함
이렇게 수집한 정보를 바탕으로 새로운 매니페스트 파일과 스냅샷을 생성하고 /metadata/ 에 저장함

Spark Driver 가 어떻게 Iceberg 매니페스트/스냅샷 생성하는 방법을 알고 있을까?
바로 Spark 가 Apache Iceberg 가 제공하는 핵심 라이브러리를 사용하기 때문.
이 라이브러리를 이용하여 매니페스트/스냅샷을 생성하는 방법을 알게 됨

저작자표시 비영리 동일조건 (새창열림)

'Hadoop' 카테고리의 다른 글

[Hadoop] YARN Resource Manager WebUI 설명 (1)	2025.05.25
HDFS path 에 사용하는 와일드 카드, Globbing (0)	2023.09.13
[YARN] Fair Scheduler 설명 링크 (0)	2022.05.30
[Hadoop] HDFS 디렉터리/파일들을 har 파일 하나로 archiving 하는 방법 (1)	2022.04.01
[HDFS] 내가 맞닥뜨린 Permission denied 이슈 (2)	2022.01.18

개발 영어 공부

눈가락 2025. 2. 9. 22:39

2025. 2. 9. 22:39

https://spark.apache.org/docs/latest/index.html

- pandas API on Spark for pandas workloads

- Downloads are pre-packaged for a handful of popular Hadoop versions

- Spark runs on both Windows and UNIX-like systems, and it should run on any platform that runs a supported version of Java

- it is necessary for applications to use the same version of Scala that Spark was compiled for

For example, when using Scala 2.13, use Spark compiled for 2.13

- use this class in the top-level Spark directory.

- with this approach, each appliction is given a maximum amount of resources it can use

and holds onto them for its whole duration.

- Resource allocation can be configured as follows, based on the cluster type.

- At a high level, Spark should relinquish executors when they are no longer used and acquire when they are needed.

- We need a set of heuristics to determine when to remove and request executors.

- By default, Spark's scheduler runs jobs in FIFO fashion.

- If the jobs at the head of the queue don't need to use the whole cluster,

later jobs can start to run right away, but if the jobs at the head of the queue are large,

then later jobs may be delayed significantly.

- Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion,

so that all jobs get a roughly equal share of cluster resources.

- This feature is disabled by default and available on all coarse-grained cluster managers.

- Without any intervention, newly submitted jobs go into a default pool

- This is done as follows

- This setting is per-thread to make it easy to have a thread run multiple jobs on behalf of the same user.

- If you would like to clear the pool that a thread is associated with, simply call this.

- jobs run in FIFO order.

- each user's queries will run in order instead of later queries taking resources from that user's earlier ones.

- At a high level, every Spark application consists of a driver program that runs the user's main function and executes various parallel opperations on a cluster.

- ...the cluster that can be operated on in parallel.

- This guide shows each of these features in each of Spark's supported languages.

- it's easiest to follow along with if you launch Spark's interactive shell.

- It is not only Value but also Pointer, both of these together make up the node.

- We do it by just having the next value of A node be the B node.

- the same is true of the C node.

- if you look at how we're going to have to traverse this, we are going to have to start at head.

- that's what we are going to do down here with this print statement.

- the syntac is a little bit different than if you are going to use dictionaries.

- Our managers deal with all kinds of clients every day. So I can say that we maintain the highest level of service.

- In my work I follow the best practices to maintain clean and easy to understand python codes.

- In my previous project I carried out the responsibilities of both the Project Manager and the Team Leader.

carry out : to do something, to perform

both A and B

- As a QA specialist I worked with a test environment where I tested many aspects of the platform

to ensure that it works as desired.

- A cloud architect oversees application architecture

and deploys it in cloud environments like public cloud, private cloud and hybrid cloud.

to oversee : to watch over and control something to make sure that the work is good or satisfactory, to supervise

- I took a course where I learned how to design and write programs that are easy to maintain.

to design : to create, draw, or construct something

- I will set up all the necessary equipment in my home office to work remotely on this project

- I'm an IT Technician, so I install and configure different software on all computers in the office.

to install : to put a new program or piece of software into a computer

to configure : to chagne setting of software on a computer

- As a Jr Software Engineer, I assist and participate in the research, design, development and testing software and tools.

to assist : to help someone or something

- I am a web designer, so I know hot to provide the best UX for your website visitors.

- Project managers usually estimate new projects by analogy, using previous projecs and past experience.

to estimate : to give a general idea of the cost of work or the time you need to do the work

analogy : a comparison of two things based on their being alike in some way

- Sometimes I need to google my questions, for ex "how to execute the code inside of function in JS"

- Working on my project I improved my time management and organizational skills.

- press the F2 key on your keyboard

- The screen resolution is 1366x768

- I prefer work with desktop

- Workstation PCs have multiple processor cores.

- Some tablets have a long battery life

- The volume on my speakers won't ture up.

- My printer broke down, so I printed out these documentations at work.

- 'ram' is not countable, so only possible to say 'ram is' or 'ram was', not 'rams' or 'rams are'

- with a cable : wired mouse, wired connection(Ethernet)

- without a cable : wireless mouse, wireless connection(WiFi)

- ISP stands for Internet Service Provider.

- so many folers on my desktop

- start or shut down a computer.

- turn on or turn off a computer

- to crash / to freeze up : when a computer suddenly stops working

- to look up a word or address : to find something.

we can use 'nslookup' command in terminal to query to DNS server.

- It will take about two hours to key in all this data.

to key in : to enter info into computer

- a shortcut key : 단축키.

Use Ctrl + L shortcut to see the last saved version.

- 'perform' is used a lot more than my thought. for ex, the server performs instructions written in code.

- use only the numbers in given array.

- I assigned the number 33 to age variable.

- fraction : 분수

- numerator : 분자

- denominator : 분모

ex, 1/3 : one thrid, 2/3 : two thirds, 1/2 : a half(second), 1/4 : a quarter, 3/4 : three quarters

- decimal : 소수

- decimal point : 소수점

- floating point : 부동소수점

ex, 1.23 : one point two three, 15.1 : fifteen point one

'double' also has a point but 'float' and 'double' are different each.

let's look up how they consist differently.

- I created an array of strings.

- To debug is to investigate the program and fix bugs.

- Comment is a text written around code that is ignored by the computer.

It is used for writing extra info about your code to help you undertand it later.

so we can say, 'leave comments in your code.'

- 'Comment out' is to turn a piece of code into a comment with the help of special characters.

like, //, #, -- ... etc

you can comment out some lines to see how it works without them.

- Constant is a variable that never changes its value.

for ex, val a = 1, final int a= 1

we can say "In Java, a constant is assigned using the final keyword"

"the PI constant has the value of 3.14"

- If you try to divide a number by zero, your program will crash.

A program crashes when it stops running because of an error.

- An 'executable' is a program which is ready to be run.

Short for executable file, executable program

A common filename extension .exe means that it is an executable file.

it sounds '엨즤큐터블'

- To declare(선언) in programming means to say that something exists

usually a variable, a function, or a class.

I've only declared a function, but I haven't written it yet.

- To implement(구현) means to write and complete something in code

for example, to implement a function or a class

I declared a function and implemented it. It works well!

- To instantiate(인스턴스화) means to create an object from a class.

I instantiated another object of the Student class.

it sounds '인스탠시에이트'

- A loop is a piece of code that runs itself many times.

It can also be used as a verb - to loop or to iterate

I used a "for" loop to run this code for every value in the array.

I iterate throught every element in the list.

- He read some data values from another source over the internet.

- Syntax is the grammatical rules of a programming language.

Syntax determines if code is written correctly or not.

- find any typing mistakes if you got a syntax error.

Let's learn how to call various 'symbols'.

[ ~ ] : Tilde

[ ` ] : Backtick, Grave accent

[ ! ] : Exclamation mark

[ ? ] : Question mark

[ @ ] : At symbol

[ # ] : Number or Hash

[ $ ] : Dollar sign

[ ^ ] : Caret

[ & ] : Ampersand

[ * ] : Asterisk

[ () ] : Brackets, Parentheses

[ ( ] : Open bracket, Left bracket, Open Parenthesis, Left Parenthesis

[ ) ] : Close bracket, Right bracket, Close Parenthesis, Right Parenthesis

[ {} ] : Curly braces

[ { ] : Open curly brace, Left curly brace
[ } ] : Close curly brace, Right curly brace
[ [] ] : Square brackets
[ [ ] : Open square bracket, Left square bracket
[ ] ] : Close square bracket, Right square bracket
[ _ ] : Underscore, Horizontal bar

[ - ] : Dash, Hyphen

[ = ] : Equals

[ | ] : Vertical bar, 'Or'
[ / ] : Forward slash
[ \ ] : Back slash
[ : ] : Colon

[ ; ] : Semicolon

[ " ] : Quote, Double quote

[ ' ] : Apostrophe, Single quote

[ < ] : Less than

[ > ] : More than, Greater than

[ . ] : Dot, Period

- top brass : top managers in the company

The top brass from the USA want to see how we work here.

- hamster wheel : a serise of company meetings

I thought this would be a productive day, but we ended up with a hamster wheel of pointless meetings.

- seagull : a manager who asks too many questions and gives advice too often

Warning, the seagull is coming! Wonder, what he would say this time...

- blamestorming : when the team members try to find who is responsible for a certain problem.

Guys, let's stop this blamestorming and think how we can solve this problem!

- space out : to stare at your screen to pretend you are working.

I need to stop spacing out and get down to work...!

- Product-based companies : Companies that work on their own products and sell them to end users.

- Service-based companies : Companies that provide different types of IT services to business clients.

- IT Consulting companies : Companies that deal with the implementation of ready-made software.

- Outsourcing (outside-resource-using) : when a company hires another company to do a certain job

e.g. software development, software support etc.

ex ) The company outsourced web-development to us.

- B2C : Business to consumer - company sells directly to individual clients.

- B2B : Business to business - company provides services or products to other businesses.

- SME : Small and Medium-sized Enterprises

- corporation, MNC(Multinational Corporation) : a big company that operates in two or more countries.

the opposite of start-up

- social enterprise : a business that tries to reach certain social goals apart from making profit.

they usually cares about environment.

- be based in : = be located at/in

ex) The company is based in SanFrancisco.

ex) The company is located at IQ Business Center.

- to specialize in : your company's field

ex) Our company specializes in Data Engineering.

- to develop, to deliver, to offer : to provide

ex) We offer digital consulting services.

ex) Our company delivers full-cycle software development services.

- target

ex) Teenagers are the target audience for our app!

ex) Our app targets college-aged adults.

- subsidiary, daughter company : a company that is owned or controlled by another larger company.

ex) After a merger in 2019, our company became a subsidiary of EDB group.

- SDLC : Software Development Life Cycle which is a process of software creation

it could consist of several stages like 'Planning', 'Designing', 'Development', 'Testing', 'Deployment', 'Maintenance', or etc

- Phase 1. Requirements collection (Planning)

- Business requirements are gathered and documented.

- Major stakeholders give their input (stakeholder : people or groups who have an interest in or are affected by a decision, project, or organization.)

- Project scope is outlined, budget, resources, deadlines, and potential risks

and quality assurance requirements are defined.

(Project scope : all aspects of a project, including all activities, resources, etc

to outline : to describe something in a general way without giving too many details)

- These are involved : Business analyst, Subject matter expert, Major stakeholders, PM

- Phase 2. Design

- Software development requirements are translated into design.

- The entire system and its elements need to be designed (including high-level design and low-level design)

(high-level design (HLD) : the system's architectural design. general picture.

low-level design (LLD) : the design of its components; a detailed description of all components, configs, and processes)

- This stage includes the design of user interfaces, system interfaces, network, and network requirements, DBs.

- Operation, training, and maintenance plans are drawn up so that developers know what they need to do throughout every stage of the cycle.

(drawn up : to prepare a draft or something)

- Phase 3. Development

- Using the design document, software developers write code for all the components.

- Program code is built per the design document specifications.

(per : according to

specification / technical specification (tech spec) = a document that explains what a product will do and how you will achieve these goals)

- Every developer has to stick to the agreed blueprint.

(to stick to something : to keep doing a particular thing and not change to anything else, to follow the specification

blueprint : a detailed plan of how to do something)

- Developers utilize different tools, for example compilers, debuggers, and interpreters.

- The tasks are divided among the team members according to their area of specialization (front-end, back-end, DB administration etc)

- it's the most time-consuming phase. (time-consuming : using or taking up a lot of time)

- The result of this phase is a working software product.

- Phase 4. Testing

- The goal(or objective) is to ensure the software meets requirements.

- This is where the Qaulity Assurance(QA) team steps in to test the software.

(steps in : to become involved, start doing something on the project)

- All the modules of the software are brought together into a special testing environment and tested for errors and interoperability.

(bring together : assemble, collect, compile,

interoperability : an ability of one system or application to interact with another system or application)

- Software developers fix any bugs that come up during this stage. Then QA specialists test the software or its components again.

- All the defects are tracked, fixed, and retested.

- There are different kinds of testing: Functional testing, Performance testing, Unit testing, Integration testing, Regression testing etc.

- QC(Quality Control) is a set of activities designed to evaluate the quality of a component or system.

- Phase 5. Deployment

- The product is deployed in the production environment.

(to deploy : to make a software system available for use)

- If the customer wishes, UAT (User Acceptance Testing) is done before deployment.

For UAT, a replica of the production environment is created and the customer company does the testing!

(replica : an exact copy)

- Once they check that the product works as expected, they give a sign off to go live.

(go live : the point at which code moves from the test env to the prod env, therefore becomes available for end users)

- The customer may also come up with changes or enhancements to the software behavior.

These changes are called change requests.

- After they are done, the product is released to the market or deployed in the company's production environment.

(release : the distribution of the final version of an application)

- Phase 6. Maintenance (support)

- During this, the system is assessed to ensure it doesn't become obsolete(out-of-date, old-fashioned).

(to access : to check and decide about the quality of something,

assessment : the process of checking and considering all the information about something; making a judgement,

obsolete : that is not in use anymore and has to be replaced by something newer and better)

- If any issue comes up and needs to be fixed or any enhancement needs to be done developers taken care of that.

- This is also where changes can be made to initial software.

If you learn how to code in Java, you can choose from hundreds of jobs on the market.

Apps for Android OS are built on Java.

Almost all of the apps you use on your Android phone run on Java.

Arond 80% of the world's largest websites use back-end web apps built with Java(with the help of Java)

A framework is a collection of languages, libraries, and utilities designed to help developers build applications.

Spring is a web application framework with clear and elegant syntax.

Utility is a small program that provides an addition to the capabilities provided by the OS

Syntax is rules that define the structure of a language.

On one hand Django ensures rapid development, fast processing, and scalability,

whereas on the other hand it has monolithic, nature, and is not suitable for smaller projects.

Monolithic is composed all in one piece.

Scalability is the property of a system to handle a growing amount of work by adding resources to the system.

Processing is manipulation of data by a computer. e.g. conversion of raw data into machine-readable form.

APIs are applications that help you connect to different tools.

Those tools make up your extended tech stack.

API stands for Application Programming Interface.

This category includes servers, content distribution networks, routing and caching services that let your applications send and receive requests, run smoothly, and scale capacity as needed.

Routing is process of selecting a path for traffic in a network or across multiple networks.

Reuqest-Response is one of the basic methods, that computers use to communicate with each other in a network. The first computer sends a request for some data and the second responds to the request.

This layer of the stack consists of relational and non-relational databases, data warehouses, and data pipelines that allow you to store and query all of your real-time and historical data.

Query is a request for data from a database table or combination of tables. This data may be generated as results returned by SQL.

BI tools bring together data gathered from multiple parts of the company and the market, and are designed to help track company performance and make higher-level business decisions.

Track is to record the progress or development of something over a period.

Full-stack developer can work well with the variety of languages as well as frameworks and can quickly learn something new.

Full-stack developers usually have skills in a lot of different niches, from databases etc.

Maintainability : It should be stable when the changes are made. It's easy to maintain the code and add amendments.

Compatibility : the software is compatible with several components.

Reliability : it's defined as the capability of the software to perform under specific conditions for a specified duration.

If you move your mouse over the picture, you can see the hint!

Pete you cannot work like this! You need a dedicated work space (wft setup)

If you don't have a stable Internet connection, it may take long to load some pages.

Internet outage : no internet connection, Internet is down

ex) I had an internet outage during a meeting yesterday

Power outage : no electricity

BYOD : bring your own device

COBO : company owned, business only

COPE : company owned, personally enabled

My employer provided me with a laptop and all the software I need was pre-loaded.

distributed team : members of a team work from different locations.

ex) Our company is headquartered in San Francisco with a distributed team across 5 countries.

hybrid team : some members of are fully remote, others may come to the office.

all-remote company : A company that doesn't have offices at all and all employees work from home.

to work flextime : Flexible, you can change the time you start and finish work.

to maintain regular hours : to start and finish at the same time.

ex) I prefer to maintain regular hours of work, othewise its very hard to get things done!

You have to track time you spent working in order to be paid overtime.

I need to log hours at the end of every week.

Conference call : a telephone call in which people in different places can ALL talk to each other.

(=to be in a call, to have a call)

Sorry, I'm in a meeting.

I was having a meeting so I missed a delivery man.

We are having a conference call with client where we'll discuss possible solutions to this issue.

back-to-back meetings : meetings without a break

ex) I finished my first meeting at 10am and the second one started 10am without a break.\

ex) I had five back-to-back meetings today, I am exhausted.

to reschedule

to move a meeting up : to start earlier

to move a meeting back : to start later

ex) The meeting was rescheduled for Thursday.

ex) Sorry I have my English class at 9am, Let's move our meeting back an hour.

I will stop sharing my screen now, and we can go into detail after the brak.

Could I jump in for a second? (means interruptions)

Let me clarify, what is the deadline for this task? (clarify = explain, elaborate. talk in more detail about something)

Tina, you are on mute. please unmute yourself and repeat what you were saying.

I can hear some background noises. If you are not speaking, please put yourself on mute!

I think there is a delay, that's why Peter answers late.

I got kicked out of the meeting. (I got disconnected)

Sorry, I just jump to another meeting.

Could you speak up?

Speak closer to the mice, you are too quite.

When I just joined the company, I was constantly overworking. This soon led to burnout.

Agenda : list of objectives, topics to discuss in a meeting.

ex) When you create an online meeting, please always put the agenda in the invitation as well.

ex) We have a number of important matters on the agenda.

Apologies : announcing that some people are absent

usually those people ask beforehand to give their apologies at a meeting that they cannot attent.

apologies for absence.

ex) Hi Jane, I won't be able to join the weekly team meeting today as I have a client meeting. plz give my apologies.

ex) I have received apologies for the absence of Peter. he is on a sick leave.

Chairperson/ chair : the person who leads a meeting

ex) As chair, I want to take a moment to thank everybody for participating and sharing your thoughts and ideas.

Minutes : a written documentation/record of what was said at a meeting. can be detailed or just in the form of bullet points.

ex) First of all, Let's quickly review the meeting minutes from last week and see if have any open issues.

ex) Let's go over the minutes from our last meeting.

Designate : assign, ask someone to do something

ex) Does anyboy volunteer to take the minutes or shall I designate someone?

Formality : a procedure that has to be followed due to a rule

ex) I will schedule a weekly meeting and take care of all the formalities, so that the team can concentrate on their work.

Objectives : goals to accomplish topics to discuss at the meeting (usually as points in agenda)

ex) I'm happy that we covered all the objectives today within the designated time.

Show of hands : raised hands to express an opinion in a vote

ex) Let's decide if we need a short break with the show of hands. Please raise your hand who is for or againt it.

When a participant gets the invitation, they respond to it.

Accept : "Yes, I will attend!"

Tentatively accept : "I don't know yet"

Decline : "No, I won't attend"

When a participant receives an invitation, they can also forward the invitation to a colleague.

Forward : to send the original invitation to someone else.

In an online meeting there sometimes can be a so-called 'lobby'

When a person joins a meeting, they first get into the lobby.

it means, that they wait to be let into the actual meeting room.

The organizer of the meeting needs to admit participants to let them into the meeting room.

Before we move on, I think we need to look at how we can ensure that it will not happend during the next sprint.

Let's move on to the status of our ongoing projects.

Jane, would you like to kick off?(=start) wolud you like to introduce the first item on the today's agenda?

I'd like to hand over to Tomas(who is gonna tell....)

'hand over' means that you ask the person to speak about something, to introduce a topic, to give an opinion on something.

Okay, Martin. over to you.

'over to you' means you give them control of the discussion.

Tomas could you please comment on it? (You want Tomas to say or to add something on the topic that you are discussing)

In summary, we're going to do the following. We've decided on this following.

This is what we've agreed on. We will meet in a week and synchronize on the progress.

The meeting is adjourned. Thank you all for attending.

I guess that's all for today. Thanks for coming!

That's it for today. have a nice rest of the day, every one!

Some people think that rapid development of AI is dangerous.

In ML, computer systems utilize complex data to recognize patterns and make appropriate decisions.

Most IoT devices are Wi-Fi enabled, but bluetooth can also be used to transfer data to nearby devices.

Big data enables you to gather data from social media, web visits, call logs, and other sources to improve customer experience.

Big data can be stored in the cloud, on premises, or both.

Traditional data is measured in megabytes, gigabytes and terabytes, but big data is stored in petabytes and zettabytes.

Big data is used in different industries to identify patterns and trends, answer questions, gain insights into customers' preferences, an tackle problems.

A kafka topic is identified by its name

and a kafka topic supports any kind of message format

The sequence of messages is called a data stream.

Topics are split in partitions.

Each message gets an incremental id, called offset.

Kafka topics are immutable. Once data is written to a partition, it cannot be changed.

Data is kept only for a limited time : 유지된다

Order is guaranteed only within a partition (not across partitions) : 해당 파티션 내에서만 순서가 보장되고, 다른 파티션들 간의 순서는 보장되지 않는다는 의미의 across

Each consumer within a group reads data from exclusive partitions. 각각의 파티션에서 데이터를 읽는다는 의미의 exclusive

Each brokers is identified with its ID. (each 는 주어를 단수로 만듦)

In these examples, we choose to number brokers starting at 100

Over time, the kafka clients and CLI have been migrated to leverage the brokers as a connection endpoint instead of Zookeeper.

저작자표시 비영리 동일조건 (새창열림)

'English' 카테고리의 다른 글

[IT] 개발 영어 공부 - 빅데이터를 지탱하는 기술 1 (2)	2025.06.12
[Duo] section 01 ~ 43 (1)	2025.05.20
Study English 24.07.03-05 (0)	2024.07.06
Study English 24.06.29-07.02 (0)	2024.07.02
Study English 24.06.28 (0)	2024.06.29

[Python3] 개인 문법 공부

눈가락 2025. 1. 22. 15:03

2025. 1. 22. 15:03

https://eyeballs.tistory.com/648

[IT] CS 면접 대비 Python 질문 모음

< First-Class 함수 > First-Class 함수 : 프로그래밍 언어가 함수(Function)를 first-class 시민으로 취급하는 것 함수가 다른 함수의 인자로 전달될 수 있고, 함수의 결과로 리턴될 수 있고, 변수에 함수를 할

eyeballs.tistory.com

python3 : print("hi eyeballs!")
python2 : print "hi eyeballs!"

python 은 동적 언어(dynamic language) 이기 때문에
변수를 생성할 때 타입을 직접 작성하지 않음
또한, 변수(데이터 값)는 객체이며
객체는 내부에 타입 정보, 실제 값, 객체 ID, 참조 횟수 등을 갖고 있음
그래서 type(123) 등으로 타입을 확인할 수 있는 것임

타입을 확인하기 위해 아래 메소드를 사용 가능

type(1) # <class 'int' >
isinstance(1, int) # True
isinstance("hi!", int) # False

python은 강타입(strong type) 언어임
즉, 객체의 값 변경은 가능하지만, 객체의 타입은 변경할 수 없음

python 에서 변수는 객체를 가리키는 이름임
다른 정적 언어(static language) 들은 변수 자체에 타입이 있기 때문에,
변수에 값을 할당 할 때부터 타입을 지정해줘야 하지만
python 은 동적 언어이기 때문에, 변수 자체에 타입이 없고
변수에 값을 할당 할 때 타입 지정이 필요 없음

a = 1
b = a
print(a) # 1
id(a) #9440320
print(b) # 1
id(b) #9440320

a = 2
print(a) # 2
id(a) # 9440355
print(b) # 1
id(b) # 9440320

여기서 id 값이 달라짐
왜냐면 a = 2 를 통해 a에 다른 불변 객체(2)를 바라보도록 할당했기 때문
b는 여전히 기존 불변 객체(1)를 바라보고 있음

a = [1,2,3]
b = a
print(a) # [1,2,3]
print(b) # [1,2,3]

a[0] = 99
print(a) # [99,2,3]
print(b) # [99,2,3]

반대로 여기선 b 가 a와 동일하게 업데이트 되었음
왜냐면 list 는 가변 값의 배열이기 때문

하지만 list 객체 자체는 불변임
만약 a를 새로 할당했다면, b 는 바뀌지 않았을 것

a = [1,2,3]
b = a
a = [2,3,4]
print(a) # [2,3,4]
print(b) # [1,2,3]

a = "!"
b = "!"
id(a) #21926656
id(b) # 21926656

이렇게 자주 사용된다 싶은 객체(여기선 "!" 를 담고 있는 객체)는 파이썬이 따로 저장해 둠

def func(p) : ...
func(1)
func([1,2,3])
func("eyeballs")

함수 파라미터에 들어가는 정보는 "변수의 참조값"임
def func(p) 에서 p 는 참조값을 넘겨받은 변수가 됨
이를 Call by Object Reference 라고 부름

0이나 empty 값, None 아닌 값은 True 로 간주함

bool(True) #True
bool(1) #True
bool(-1) #True
bool(0) #False
bool(0.0) #False
bool("") #False
bool(None) #False
bool(Set()) #False
bool({}) #False
bool(()) #False
bool([]) #False

숫자를 표현할 때 세 자릿수를 underbar 를 이용하여 표현 가능
million = 1_000_000
print(million) # 1000000

물론 꼭 위와 같이 쓰지 않아도, 숫자 사이에 어느 곳에나 넣을 수 있음
a = 1_2_3
print(a) # 123

print(1/2) # 0.5
print(1//2) # 0. 소수점 이하 버려짐
print(1%2) # 1. 나머지를 반환

print(chr(65)) # 'A'
print(ord('A')) # 65

int 는 굉장히 큰 값도 들어감. 심지어 10의 100제곱(googol)도 들어감

print(int(10**100)) # 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

a = b = c = 1
print(a) # 1
print(b) # 1
print(c) # 1

print(1<2<3) # True

a = 1
print(a == 1) # True
print(a != 1) # False
print((not a) == 1) # False

문자열, 배열 등 내부 값을 갖는 객체에서 어떤 값이 포함되어있는지 확인하려면 in 을 사용

print('a' in "abcde") # True
print("abc" in "abcde") # True
print('a' in ('a','b','c')) # True
print('a' in {'a':'a', 'b':'b'}) # True

:= 는 바다코끼리 연산자라고 불림...
이것은 코드 실행과 실행 결과 할당을 한 번에 처리할 수 있게 도와줌

이를테면 아래 두 코드는 동일한 결과를 출력함

diff = 2 - 1
if diff >= 0 :
print("+")
else :
print("-")

출력 결과 : +

if diff := 2-1 >= 0 :
print("+")
else :
print("-")

탈출 문자(Escape character) 를 무효화하려면 r 포맷팅을 사용

print(r"a\nb\tc\\d\e") # a\nb\tc\\d\e

문자열을 연결할 때 + 를 사용 할 필요 없음

print("a" "b" "c") # "abc"

print("aba".replace('a', 'x')) # xbx
print("abcdefg"[::2]) # aceg. 2 개씩 건너뛰면서 슬라이싱
print("abcdefg"[::-1]) # gfedcba. -1 개씩 건너뛰며 슬라이싱. 결과적으로, 뒤에서부터 읽게 되어 문자열이 reverse 됨

문자열 검색에 사용되는 메소드는 두 가지
find() : 처음부터 문자열을 찾으면 오프셋(index)를 반환, 못 찾으면 -1 을 반환
index() : 처음부터 문자열을 찾으면 오프셋(index)를 반환, 못 찾으면 에러
rfind() : 끝에서부터 문자열을 찾으면 오프셋(index)를 반환, 못 찾으면 -1 을 반환
rindex() : 끝에서 부터 문자열을 찾으면 오프셋(index)를 반환, 못 찾으면 에러

"abcba".find("b") # 1
"abcba".find("x") # -1
"abcba".rfind("b") # 3

문자열이 알파벳이나 숫자로 이루어져있는지 확인 가능

"123".isnumeric() #True. 숫자로만 이루어져 있음
"123".isdigit() #True. 숫자로만 이루어져 있음
"abc".isalpha() # True. 문자로만 이루어져 있음
"abc123".isalnum() # True. 숫자와 문자로만 이루어져 있음

"..a..".strip('.') # 'a'
"..a..".lstrip('.') # 'a..'
"..a..".rstrip('.') # '..a'

format 사용 예제

"a{}{}d".format("b","c") # 'abcd'
"a{1}{0}d".format("b","c") # 'acbd'
"a{x}{y}d".format(x="b", y="c") # 'abcd'

f-문자열 사용 예제

a = "a"
b = "b"
f"x{a}{b}y" # "xaby"
f"x{a=}{b=}y" # "xa='a'b='b'y"

break 문이 포함된 loop 문에서 break 가 실행되지 않으면 실행되는 무언가를 넣어야 할 때 else 를 사용함
즉, break 없이 잘 실행되었을 때 else 가 실행됨

while index < len([0,1,2,3]):
index += 1
if index > 5 : break
else : print("no break")

출력 결과 : no break

for n in (0,1,2) :
if n < 0 : break
else : print("no break")

출력 결과 : no break

위와 같이 else 를 break checker 로 사용 가능

(1,2,3) ['a', 'b', 'c'] 를 각 index 별로 묶어서
[(1, 'a'), (2, 'b'), (3, 'c')] 로 만드는 기능은 zip 메소드를 통해 가능함

a = (1,2,3)
b = ['a', 'b', 'c']
z = zip(a, b)

type(z) # <class 'zip' >

print(list(z)) # [(1, 'a'), (2, 'b'), (3, 'c')]

for x, y in zip(a,b) :
print(x+", "+y)

출력 결과 :
1, a
2, b
3, c

>>> a = (1,2,3)
>>> b = ['a','b','c']
>>> c = ([], (), {})
>>> z = zip(a,b,c)
>>> print(list(z))
[(1, 'a', []), (2, 'b', ()), (3, 'c', {})]

>>> a = (1,2,3)
>>> b = ['a','b','c']
>>> print( dict(zip(a,b)) )
{1: 'a', 2: 'b', 3: 'c'}

a = (1,2,3) # a 는 3개
b = ['a'] # b 는 1개. a 보다 2개 적음
z = zip(a, b)

print(list(z)) # [(1, 'a')]. 가장 적은 개수를 갖는 b 에 따라 zip 결과가 정해짐

zip 메소드로 나온 결과를 어떤 방식으로든 한 번 사용하면
그 뒤로는 empty 값이 나와버림

z = zip(a,b)
print(list(z)) # [(1, 'a'), (2, 'b'), (3, 'c')]
print(list(z)) # [] 바로 윗 줄에서 사용했기 때문에 z 에 빈 값이 들어감

튜플 만드는 법

t = ()
t = "eyeballs",
t = ("eyeballs",)
t = 1, 2, 3
t = (1, 2, 3)
t = tuple([1,2,3])
t = (1, 2) + (3, 4)
t += t

튜플로 여러 변수에 값을 넣어줄 수 있음

a, b, c = (1,2,3)
a, b, c = 1, 2, 3
a, b = b, a

named tuple 이란 것이 있음
이름과 위치로 값에 접근 가능한 자료구조임
튜플의 서브클래스이며, collections 모듈을 통해 사용 가능

이름이 있는 필드를 가진 불변(immutable) 객체를 만들어 사용한다고 생각하면 됨
튜플처럼 동작하면서도 필드에 이름을 부여할 수 있어 가독성이 뛰어나고 코드 유지보수성이 좋아짐

from collections import namedtuple
>>> Person = namedtuple("Person", ["name", "age"])
>>> p1 = Person(name = "A", age = 30) # 속성은 두 가지 뿐이지만, 불변의 dict 같은 객체를 만들 수 있음
>>> p2 = Person(name = "B", age = 60)
>>> p1[0], p1[1]
('A', 30)
>>> p2.name, p2.age
('B', 60)

>>> p3 = Person._make(['C', 90]) # _make 를 사용하여, 리스트를 namedtuple 에 바로 넣음(삽입)
>>> p3[0], p3[1]
('C', 90)

>>> type(p3._asdict()) # _asdict 를 사용하여, namedtuple 을 dictionary 로 변경
<class 'dict'>
>>> print(p3._asdict())
{'name': 'C', 'age': 90}

>>> p4 = p3._replace(age=10) # _replace 를 사용하여, 기존 namedtuple 속성을 수정한 새로운 namtedtuple 객체 생성
>>> print(p4)
Person(name='C', age=10)
>>> id(p3)
68810088
>>> id(p4)
68794984 # 기반이 된 p3 과는 다른, 새로운 객체 p4. 왜냐면 p3 는 불변이라 수정이 불가능하니 새로 생성

이게 왜 tuple 이랑 비슷하게 동작한다는 것인지 모르겠음...
dirtionary 랑 더 비슷해 보임

dictionary 보다 namedtuple 이 더 효율적으로 동작한다고 함

리스트 만드는 법

l = []
l = [1,2,3]
l = list()
l = list('eyeballs') # ['e','y','e','b','a','l','l','s']
l = list((1,2,3))
l = "a.b.c".split(".")
l += l

append 를 사용하면 메소드의 리스트 인자가 리스트의 마지막 항목에 들어감
extend 를 사용하면 메소드의 리스트 인자가 리스트에 병합됨

>>> a = [1,2,3]
>>> a.append([4,5])
>>> print(a) # [1, 2, 3, [4, 5]]

>>> a = [1,2,3]
>>> a.extend([4,5])
>>> print(a) # [1, 2, 3, 4, 5]

extend는 + 연산과 기능 동일함

>>> a = [1,2,3]
>>> print(a + [4,5])
[1, 2, 3, 4, 5]

list 에서 항목 제거하기

a = [1,2,3]
del a[1] # 숫자 2를 삭제
print(a) # [1,3]

a = [1,2,3]
a.remove(2) # 숫자 2를 삭제
print(a) # [1,3]

a.del(1) 이런 문법은 아님...

list 에서 항목을 가져옴과 동시에 제거하기

a = [1,2,3,4,5]
n = a.pop()
print(n) # 5
print(a) # [1,2,3,4]

a = [1,2,3,4,5]
n = a.pop(0)
print(n) # 1
print(a) # [2,3,4,5]

a = [1,2,3,4,5]
n = a.pop(1)
print(n) # 2
print(a) # [1,3,4,5]
n = a.pop(1)
print(n) # 3
print(a) # [1,4,5]

sort() 는 list 자체 내부 정렬을 진행함
sorted() 는 list 의 정렬된 복사본을 반환

s = [2,5,4,1,3]
result = s.sort()
print(result) # None
print(s) # [1,2,3,4,5]

s = [2,5,4,1,3]
result = sorted(s)
print(result) # [1,2,3,4,5]
print(s) # [2,5,4,1,3]

내림차순 정렬하려면 인수에 reverse = True 추가

s = [2,5,4,1,3]
result = s.sort(reverse = True)
print(s) # [5,4,3,2,1]

a = [1,2,3]
b = a

여기서 a 와 b 는 동일한 리스트 객체를 바라보고 있음
a 에서 리스트가 수정되면 b에서도 수정된 객체를 바라봄

a 의 복사본을 b 에 넣고 싶다면 아래와 같은 방법들을 사용

a = [1,2,3]
b = a.copy()
a[0]=-99
print(a) # [-99, 2, 3]
print(b) # [1, 2, 3]

a = [1,2,3]
b = a[:]
a[0]=-99
print(a) # [-99, 2, 3]
print(b) # [1, 2, 3]

a = [1,2,3]
b = list(a)
a[0]=-99
print(a) # [-99, 2, 3]
print(b) # [1, 2, 3]

copy 는 얕은 복사임. 중첩된 list 까지는 복사하지 못 함

a = [1,2,[3,4,5]]
b = a.copy()
a[2][0]=-99
print(a) # [1, 2, [-99, 4, 5]]
print(b) # [1, 2, [-99, 4, 5]]
id(a[2]) # 30645800
id(b[2]) # 30645800

내부에 중첩된 list 까지 모두 제대로 복사하려면 deepcopy 를 사용하면 됨

import copy
a = [1, 2, [3, 4, 5]]
b = copy.deepcopy(a)
a[2][0] = -99
print(a) # [1, 2, [-99, 4, 5]]
print(b) # [1, 2, [3, 4, 5]]
id(a[2]) # 62359848
id(b[2]) # 30659592

리스트 컴프리헨션을 사용하여, 한 줄로 for 문을 구현할 수 있음

mylist = [i for i in a]
print(mylist) # [1, 2, 3, 4, 5]

mylist = [i**2 for i in a]
print(mylist) # [1, 4, 9, 16, 25]

mylist = [i for i in a if i%2==0]
print(mylist) # [2, 4]

list 보다 tuple 을 사용하는 이유는
- tuple 이 공간을 더 적게 사용함
- tuple 은 한 번 생성되면 내부 값들이 변하지 않기 때문에, 값이 손상될 염려가 없음
- tuple 을 dictionary key 로 사용 가능 (list 는 안 되나보네..?)
- namedtuple 이 객체의 단순한 대안으로 사용 가능함

dictionary 만드는 법

d = {}
d = dict()
d = {"a" : 1, 2 : "b"}
d = dict( [ ['a',1], ['b',2], ['c',3] ] )
d = dict( ( ['a',1], ['b',2], ('c',3) ) )

dictionary 에서 값 확인 및 추출하는 법

d['a'] # 만약 'a' 키가 없으면 exception
d.get('a') # 만약 'a' 키가 없으면 None 반환
d.get('a', 'nothing here') # 만약 'a' 키가 없으면 'nothing here' 를 반환

'a' in d # 'a' 가 d 의 key 값이라면 True
if key := 'a' in d:
print("key : ", key)
print("value : ", d['a'])

d.keys() # 모든 키 얻기
d.values() # 모든 값 얻기
d.items() # 모든 키값 쌍 얻기

dictionary 합치기

d1 = {1:1}
d2 = {1:1, 2:2}
d3 = {2:2, 3:3}
d = {**d1, **d2} #{1: 1, 2: 2}
d = {**d1, **d2, **d3} # {1: 1, 2: 2, 3: 3}

d1 = {1:1}
d2 = {1:1, 2:2}
d1.update(d2) # 결과값이 나오는 메소드가 아님, 자기 자신의 dictionary 에 추가하는 것
print(d1) # {1: 1, 2: 2}

d1 = {1:1}
d2 = {1:'a'}
d1.update(d2)
print(d1) # {1:'a'}. key 가 동일한 아이템은 update 의 인자(d2) 값으로 업데이트 됨

dictionary 삭제

d = {1:1, 2:2}
del d[1]
print(d) # {2:2}
del d[-99] # exception 발생

d = {1:1, 2:2}
a = d.pop(1)
print(a) # 1
print(d) # {2:2}
a = d.pop(-99) # exception 발생
a = d.pop(-99, 'Nothing here') # key 가 없는 경우, default 값 반환
print(a) # Nothing here
d.pop() # exception 발생

list 와 마찬가지로, dictionary 로 얕은 복사, 깊은 복사가 있음

얕은 복사
a = {1:1}
b = a.copy()

깊은 복사
import copy
a = {1:1}
b = copy.deepcopy(a)

== 혹은 != 를 사용하여 비교 가능함

a = {1:1, 2:2}
b = {2:2, 1:1}

a==b # True
a!=b # False
not a==b # False

a = {1:[1,2]}
b = {1:[2,3]}
a==b # False

list 와 마찬가지로, 딕셔너리 컴프리헨션을 사용하여 for 문을 한 줄로 사용 가능함

>>> d = {k : k for k in (1,2,3)}
>>> d
{1: 1, 2: 2, 3: 3}

>>> word = "eyeballs"
>>> letter_counter = {key: word.count(key) for key in word}
>>> letter_counter
{'e': 2, 'y': 1, 'b': 1, 'a': 1, 'l': 2, 's': 1}

>>> word = "eyeballs"
>>> letter_counter = {key: word.count(key) for key in word if key in ('b','l','s')}
>>> letter_counter
{'b': 1, 'l': 2, 's': 1}

존재하지 않는 key 로 접근할 시 default 값을 반환하도록 할 수 있음

>>> d = {'a':1, 'b':2}
>>> print(d.get('c')) # get 으로 접근하면 None 을 반환받음
None

>>> d = {'a':1, 'b':2}
>>> print(d.setdefault('a', 3)) # 'a' 는 존재하는 key 이므로 1 을 반환. 왜 반환하는지는 모르겠으나 일단 반환함
1
>>> print(d.setdefault('b', 3)) # 'b' 는 존재하는 key 이므로 2 를 반환
2
>>> print(d.setdefault('c', 3)) # 'c' 는 존재하지 않는 key 이므로 default 값으로 넣은 두 번째 인수 3 을 반환
3
>>> print(d)
{'a': 1, 'b': 2, 'c': 3} # 더불어 'c':3 을 추가해줌

defaultdict 를 사용하여, 존재하지 않는 key 로 접근할 시 default 값을 반환하도록 할 수 있음
이 때 함수를 넣어줄 수 있음

>>> from collections import defaultdict
>>> dd_int = defaultdict(int) # int 를 넣었음, int 의 default 값은 0으로 제공됨
>>> print(dd_int)
defaultdict(<class 'int'>, {}) # 처음에는 아무것도 없음

>>> dd_int['a'] = 1 # key 'a', value 1 을 넣음
>>> print(dd_int['a']) # key 'a' 는 존재하기 때문에 1 을 반환
1
>>> print(dd_int['b']) # key 'b' 는 존재하지 않기 때문에 int 의 default 값인 0을 반환
0
>>> print(dd_int)
defaultdict(<class 'int'>, {'a': 1, 'b': 0}) # 더불어 a, b 모두 dict 에 넣어줌

>>> print(defaultdict(str)['key']) # str 의 default 값은 "" 으로 제공됨

>>> print(defaultdict(dict)['key']) # dict 의 default 값은 {} 으로 제공됨
{}
>>> print(defaultdict(list)['key']) # list 의 default 값은 [] 으로 제공됨
[]

>>> def default_func(): return "default_value" # default 값을 반환하는 함수를 넣어 default 값을 설정할 수 있음
>>> print(defaultdict(default_func)['key'])
default_value

>>> print(defaultdict(lambda: 'default_value')['key']) # 간단하게 lambda 를 사용하여 default 값을 반환하는 함수 넣기 가능
default_value

set 생성하기

s = set()
s = {1,2,3,3}
s = set('aabbcc') # {'a', 'b', 'c'}
s = set( [1,1,2,2,3,3] ) # {1,2,3}
s = set( {1:'a', 2:'b', 2:'c'} ) # {1,2}. 키 값만 사용됨

s = {1,2,3}
s.add(4) # {1,2,3,4}
s.remove(1) # {2,3,4}

a = {1,2,3}
b = {2,3,4}

>>> a & b # {2, 3}. 교집합
>>> a - b # {1}. 차집합
>>> b - a # {4}. 차집합
>>> a | b # {1, 2, 3, 4}. 합집합
>>> a.symmetric_difference(b) # {1, 4}. exclusive

아래는 부분집합
a = {2,3}
b = {1,2,3,4}
>>> a.issubset(b) # True
>>> a <= b # True
>>> a < b # True
>>> b.issubset(a) # False
>>> b <= a # False
>>> b < a # False

a = {1, 2, 3}
b = {2}
b < a # True. b가 a 와 같지 않으면서 b의 모든 요소가 a 안에 포함된 진부분집합
b <= a # True. b의 모든 요소가 a 안에 포함된 부분 집합

a = {1, 2, 3}
b = {1, 2, 3}
b < a # False. b가 a 와 같기 때문에 False. 진부분집합이 되지 못 함
b <= a # True. b의 모든 요소가 a 안에 포함된 부분 집합

셋 컴프리헨션

>>> s = {a for a in (1,1,2,2)}
>>> s
{1, 2}
>>> s = {a%3 for a in (1,2,3,4,5,6)}
>>> s
{0, 1, 2}
>>> s = {a%3 for a in (1,2,3,4,5,6) if a % 3 != 0}
>>> s
{1, 2}

set 의 값을 불변(추가, 삭제, 업데이트 되지 않는 불변)으로 만들려면 frozenset 을 사용

>>> s = frozenset([1,1,2,2])
>>> s.add(3)
Traceback (most recent call last):
File "<pyshell#136>", line 1, in <module>
s.add(3)
AttributeError: 'frozenset' object has no attribute 'add'

None 과 False 구분 할 때는, is 를 사용함

a = None
if a is None : print("None")
else : print("False")

함수 호출시 인수 이름으로 값 직접 지정 가능

def func(a, b) :
print(a, b)

print(func(b = "BB", a = "AA")) # "AA, BB"

함수 인수의 기본값 설정

def func(a, b="BB"):
print(a, b)

print(func("AA")) # "AA, BB"
print(func(a = "AA")) # "AA, BB"
print(func("AA", "XX")) # "AA, XX"
print(func(a = "AA", b = "XX")) # "AA, XX"

함수 인수의 기본값은, 함수 호출할 때 계산되는 게 아니라, 함수가 정의될 때 계산됨
즉, 함수가 정의될 때 인수의 기본값이 유지되는 것임
아래 예제로 이해해보자

>>> def func(a, l = []):
l.append(a)
print(l)
>>> func(1)
[1]
>>> func(2)
[1, 2] # [2] 가 나올 줄 알았지만, 바로 위 1 이 포함된 [1, 2] 가 나옴. 왜냐면 인수의 기본값이 유지되기 때문

함수에 리스트 같은 가변 인수가 들어가는 경우엔, 함수 파라미터에 리스트의 참조값이 복사되기 때문에
함수 내부에서 가변 작업한 것이 리스트에 그대로 적용됨

l = [1,2,3]
def func(l) :
l.append(99)
print(l) # [1,2,3,99]

함수에 넣을 인자 개수를 특정지을 수 없는 상황일 때 인수에 애프터리크 ( * ) 를 사용

>>> def func(*args):
print(args)
>>> func(1)
(1,)
>>> func(1,2)
(1, 2)
>>> func(1,2,3,4,5)
(1, 2, 3, 4, 5)

아래와 같이, 함수 파라미터로 넣을 튜플에 애프터리스크를 사용하면
함수 내부에서 튜플로 인식하지 않고 각각의 값이 들어온 것으로 인식함(즉, 매개변수로 분해함)

>>> def func(*args):
print(args)
>>> a = (1,2,3)
>>> func(a)
((1, 2, 3),) # 튜플 하나가 들어온 것으로 인식
>>> func(*a)
(1, 2, 3) # 1, 2, 3 이 들어온 것으로 인식

가변 인자(*args) 가 앞에 올 수도 있음
가변 인자 뒤에 오는 인자들이 키워드 기반 인자라면...

>>> def func(*args, c, d):
print(args, c, d)

>>> func(1, c="CC", d="DD")
(1,) CC DD

>>> func(1,2,3,4,5, c="CC", d="DD")
(1, 2, 3, 4, 5) CC DD

>>> func(c="CC", d="DD", 1,2,3) # 가변 인자 값부터 넣어야 함
SyntaxError: positional argument follows keyword argument

애프러리스크가 두 개 붙으면, 받은 keyword 값 쌍들을 함수 내부에서 dictionary 로 만들어 받음

>>> def func(**kwargs):
print(kwargs)
>>> func()
{}
>>> func(a="AA", b="BB")
{'a': 'AA', 'b': 'BB'}
>>> func(c="CC")
{'c': 'CC'} # 애프터 리스크 사용시 인수의 기본값 유지되지 않음

근데 정작 dictionary 를 넣으면 에러가 발생..
>>> func({1:1, 2:2})
Traceback (most recent call last):
File "<pyshell#174>", line 1, in <module>
func({1:1, 2:2})
TypeError: func() takes 0 positional arguments but 1 was given

단일 애프터리스크는, 함수의 위치 기반 인수와 키워드 전용 인수 사이에 넣어, 이 둘을 구분하는 역할을 함
예를 들어 def func(a, b, *, c="CC", d="DD") 처럼 위치 기반 인수(a, b) 와 키워드 전용 인수(c, d) 를 나누고 구분짓는 역할

애프터리스크 앞에 위치한 a, b 에는 (무조건) 위치 기반의 인수가 들어가야하고
애프터리스크 뒤에 위치한 c, d 에는 (무조건) 키워드 기반 인수가 들어가야 함

>>> def func(a, b, *, c="CC", d="DD") :
print(a,b,c,d)

>>> func(1,2)
1 2 CC DD

>>> func(1) # 위치 기반으로 들어와야 할 b 의 인수가 들어오질 않아서 에러
Traceback (most recent call last):
  File "<pyshell#206>", line 1, in <module>
    func(1)
TypeError: func() missing 1 required positional argument: 'b'

>>> func(1,2)
1 2 CC DD

>>> func(1,2, c="XX")
1 2 XX DD

>>> func(c="XX", 1, 2) # 위치 기반으로 들어와야 할 a 자리에 c 가 들어와서 에러
SyntaxError: positional argument follows keyword argument

>>> func(1,2,3,4) # 키워드 기반으로 들어와야 할 c, d 자리에 키워드가 들어오질 않아서 에러
Traceback (most recent call last):
  File "<pyshell#210>", line 1, in <module>
    func(1,2,3,4)
TypeError: func() takes 2 positional arguments but 4 were given

단일 애프터리스크를 이용하여, 모든 인자를 키워드 기반 인자로 받도록 강제할 수 있음

>>> def func(*, a, b):
print(a,b)

>>> func(a="AA", b="BB") # a 와 b 에 키워드 기반 인자를 넣음
AA BB

>>> func(1) # 키워드 기반 인자가 아닌 값은 들어갈 수 없음
Traceback (most recent call last):
  File "", line 1, in
    func(1)
TypeError: func() takes 0 positional arguments but 1 was given

>>> func(1, a="AA", b="BB") # 키워드 기반 인자가 아닌 값은 들어갈 수 없음
Traceback (most recent call last):
  File "<pyshell#224>", line 1, in <module>
    func(1, a="AA", b="BB")
TypeError: func() takes 0 positional arguments but 1 positional argument (and 2 keyword-only arguments) were given

함수 바디가 시작되기 전에 문자열을 넣어 함수에 대한 간단한 문서를 작성할 수 있음
문서는 help 를 사용하거나, 함수의 .__doc__ 을 호출하여 확이 가능
이를 독스트링이라고 부름

예를 들어
>>> def func():
"this func print your name"
print("eyeballs")

>>> help(func)
Help on function func in module __main__:
func()
    this func print your name

>>> func.__doc__
'this func print your name'

이런 방식을 통해, 사용법(무슨 인자가 얼마나 어떻게 들어가야 하는지)을 모르는 함수 사용시 도움을 받을 수 있음

>>> import copy
>>> help(copy.deepcopy)
Help on function deepcopy in module copy:

deepcopy(x, memo=None, _nil=[])
    Deep copy operation on arbitrary Python objects.

    See the module's __doc__ string for more info.

이 간단한 문서를 읽어보고, 나는
"아 인수로 x, memo, _nil 이 들어갈 수 있고 memo 는 default 값이 None 이구나" 라고 알 수 있음

추가로 어떤 객체를 받았을 때 그 객체가 사용 가능한 메소드를 보려면 dir 를 사용하면 됨

>>> dir(copy.deepcopy)
['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

>>> dir("eyeballs")
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

함수 내에 함수를 선언하고 사용할 수 있음
이렇게 만든 내부 함수는 함수 바깥에서 사용 불가하기 때문에
임시로 만들어 사용하기 좋음

>>> def func():
def inner_func():
      print("hi eyeballs")
   inner_func()

>>> func()
hi eyeballs
>>> inner_func() # 여기서 내부 함수 호출이 불가능
Traceback (most recent call last):
  File "<pyshell#292>", line 1, in <module>
    inner_func()
NameError: name 'inner_func' is not defined

내부 함수를 이용하여 '클로저'를 만들 수 있음
chatGPT 에 의하면, 클로저란 다음과 같음

"클로저(Closure)는 함수 내부에서 또 다른 함수를 정의하고 반환할 때 만들어지는 함수 객체입니다. 반환된 내부 함수는 자신이 선언된 환경(외부 함수의 변수 등)을 기억하여, 외부 함수의 실행 컨텍스트가 종료된 후에도 이 정보를 사용할 수 있습니다."

즉, 생성될 때 넣어준 정보를 계속 지니고 있는 함수가 클로저임

>>> def func(mention):
def inner_func():
return f"what you said was this : {mention}"
return inner_func

>>> a = func("hi eyeballs!")
>>> b = func("bye eyeballs!")

>>> a() # a 는 클로저(함수)이기 때문에 실행이 가능하며, 실행시 클로저를 정의할 때 넣어줬던 정보를 기억함
'what you said was this : hi eyeballs!'
>>> b() # b 도 클로저(함수)이기 때문에 실행이 가능하며, 실행시 클로저를 정의할 때 넣어줬던 정보를 기억함
'what you said was this : bye eyeballs!'

클로저를 사용하면, 함수 객체 내부의 상태를 계속 유지할 수 있음.
내부 함수에서 내부 상태를 업데이트하는 기능을 넣어주면, 상태를 계속 업데이트 할 수 있음

>>> def func():
c = 0
def add(a=1):
nonlocal c
c += a
return c
return add

>>> closure = func()
>>> closure()
1
>>> closure()
2
>>> closure(98)
100

여기서 중요한 점, 외부 함수에 정의된 변수에 내부 함수가 접근할 수 없음

>>> def outer():
c = 0
def inner():
c+=1 # 내부함수(inner)에서 외부함수에 정의된 변수 c 에 접근 시도
print(c)
inner()

>>> outer()
Traceback (most recent call last):
  File "<pyshell#375>", line 1, in <module>
    outer()
  File "<pyshell#374>", line 6, in outer
    inner()
  File "<pyshell#374>", line 4, in inner
    c+=1
UnboundLocalError: local variable 'c' referenced before assignment

외부 함수에 정의된 변수에 내부 함수가 수정하려면 nonlocal 을 사용

>>> def outer():
c = 0
def inner():
nonlocal c
c+=1
   print(c)
inner()

>>> outer()
1

근데 nonlocal 이 "함수 바깥의 변수"에 수정하는 것을 도와주는 명령어는 아님

>>> c = 0
>>> def func():
nonlocal c # 가장 바깥쪽 c 에 접근하면 문법 에러가 발생함...

SyntaxError: no binding for nonlocal 'c' found

이렇게 가장 바깥쪽 c 변수가 위치한 곳을 "global scope" 라고 함
global scope 에 있는 변수는 nonlocal 을 이용하여 접근할 수 없음
nonlocal 은, 단지 nearest enclosing scope (outer scope) 에 정의된 변수에만 접근 및 수정 할 수 있게 도와줌

global scope 에 있는 변수에 접근 및 수정하려면 global 을 사용해야 함

>>> c = 0
>>> def func():
global c
c+=1
print(c)

>>> func()
1
>>> print(c)
1
>>> func()
2
>>> print(c)
2

추가로, global scope(namespace) 에 있는 변수들을 보려면 global() 을 실행하여 확인 가능
local scope(namespace) 에 있는 변수들을 locals() 을 실행하여 확인 가능

클로저를 사용하면, 내부 데이터를 은닉(캡슐화)할 수 있음
아래 예제에서 625 라는 숫자는, 클로저를 호출하는 바깥에서는 볼 수 없는 미지의 은닉된 숫자임

>>> def func():
def inner_func(a):
if a == 625 : print("correct!")
else : print("wrong")
return inner_func

>>> closure = func()
>>> closure(1)
wrong
>>> closure(50)
wrong
>>> closure(100)
wrong
>>> closure(625)
correct!

클로저는 정보를 은닉하여 계속 유지하기 때문에, 메모리에 계속 남아있게 됨
따라서 사용하지 않는 클로저는 삭제하는 것이 좋음

del closure

람다 lambda 함수는 단일 문장으로 표현되는 익명 함수임
따로 def 를 이용하여 정의내리지 않고, 그 때 그 때 필요한 때 사용하고 버림

일반적인 함수

def func(a, b):
print(a, b)

동일한 역할을 하는 람다 함수

lambda a, b : print(a, b)

아래처럼 간단하게 사용 가능

a = lambda a, b : print(a,b)
a(1,2) # 1 2

lambda 함수는, 콜론 ( : ) 뒤의 명령어가 실행되거나 반환됨

>>> a = lambda a, b : a+b
>>> a(1,2)
3

>>> x, y = 1, 2
>>> swap = lambda a, b: (b, a)
>>> x, y = swap(x, y)
>>> print(x, y)
2 1

아래와 같은 함수를 인자로 받는 함수에서

>>> def func(mylist, myfunc):
for i in mylist:
print(myfunc(i))

def 로 정의된 함수를 넣어도 되지만

>>> def myfunc(i):
return i.capitalize()
>>> func(["hi", "eyeballs"], myfunc)
Hi
Eyeballs

lambda 함수를 바로 넣을 수 있음

>>> func(["hi", "eyeballs"],lambda i: i.capitalize())
Hi
Eyeballs

아무것도 실행하지 않는 함수를 만들기 위해 pass 를 사용

def doNothing():
pass

python3 에서 모든 객체는 기본적으로 "강한 참조"로 생성됨

강한 참조는 참조 카운트를 증가시키고, garbage collector 에 의해 수거되지 않음
약한 참조는 참조 카운트를 증가시키지 않고, garbage collector 에 의해 수거됨
약한 참조는 일부러 만들어야 함
강한 참조는 메모리의 객체 자체를 직접 참조하지만, 약한 참조는 객체를 간접적으로 참조함

강한 참조와 약한 참조 예제

>>> import weakref
>>> def func(): pass # 강한 참조를 갖는 객체 생성
>>> weak_ref = weakref.ref(func) # 약한 참조를 갖는 객체 생성
>>> type(weak_ref()) # 약한 참조 객체 실행해보면 반환값이 function 인 것을 확인
<class 'function'>
>>> del func # 강한 참조 객체가 삭제되면, 약한 참조도 따라서 삭제됨
>>> type(weak_ref()) # 약한 참조 객체 실행해보면 반환값이 None 으로 변한 것을 확인
<class 'NoneType'>

- 강한 참조 (Strong Reference)
- 참조 카운트 : 증가
- 객체 생존 : 강한 참조가 존재하면 삭제되지 않음
- 활용례 : 일반적인 객체

- 약한 참조 (Weak Reference)
- 참조 카운트 : 증가하지 않음
- 객체 생존 : 강한 참조가 없으면 Garbage Collection 가능 (수거되어 사라짐)
- 활용례 : cache 를 만들거나, 메모리 관리가 중요한 앱을 만들 때 사용됨

이거 마치 리눅스의 하드 링크와 소프트(심볼릭) 링크의 관계 같다는 느낌이 듦...
강한 참조를 갖는 객체를 바라보는 약한 참조 객체는 soft link 같아서
강한 참조 객체가 사라지면(원본 file/dir 가 사라지면) 약한 참조 객체는 참조 할 객체가 사라지게 됨(soft link 가 갈 길을 잃음)

nested list 가 중첩된 리스트를 flatten 하게 만들기 위해 generator 를 아래와 같이 사용 가능

>>> lol = [1,[2,[3,4],5,[6],7]]
>>> def flatten(l):
for item in l:
if isinstance(item, list):
for subitem in flatten(item):
yield subitem
else:
yield item

>>> list( flatten(lol) )
[1, 2, 3, 4, 5, 6, 7]

try-except 예제

try:
1/0
except:
print("divided by zero")

try:
[1,2,3][4]
except IndexError as ie:
print("index error. message : ", ie) # 여기서 ie 는 시스템에서 작성해주는 error 메세지
except Exception as e:
print("exception. message : ", e)

def divide(by):
try : result = 1/by
except ZeroDivisionError as e :
print("divided by zero", e)
else : print(result) # else 는 예외가 발생하지 않았을 때 실행됨
finally : print("done")

try:
raise Exception("exception by developer on purpose") # 일부러 Exception 을 발생. error message 입력 할 수 있음
except Exception as e:
print("exception message : ", e)
riase # raise 를 다시 사용하여, 똑같은 exception 을 다시 발생시킴

exception message :  exception by developer on purpose
Traceback (most recent call last):
  File "<pyshell#619>", line 2, in <module>
    raise Exception("exception by developer on purpose")
Exception: exception by developer on purpose

try:
raise RuntimeError("exception by developer on purpose") # 원하는 Error 를 발생시킬 수 있음
except RuntimeError as re:
print("exception message :", re)

exception message : exception by developer on purpose

assert 를 이용하여 예외를 발생시킬 수 있음
asset 는 나와선 안 되는 조건을 검사할 때 넣는 명령어임
지정된 조건식이 False 일 때 AssertionError 를 발생시킴

>>> def func(i):
assert i % 3 == 0, '3의 배수가 아님'
print(i,'는 3의 배수임')

>>> func(3)
3 는 3의 배수임
>>> func(1)
Traceback (most recent call last):
  File "<pyshell#638>", line 1, in <module>
    func(1)
  File "<pyshell#636>", line 2, in func
    assert i % 3 == 0, '3의 배수가 아님'
AssertionError: 3의 배수가 아님

예외를 직접 만들 수 있음

>>> class NotThreeMultipleError(Exception):
def __init__(self):
super().__init__('3의 배수가 아닙니다.')

>>> try
raise NotThreeMultipleError
except Exception as e:
print("error message", e)

error message 3의 배수가 아닙니다.

예외 메세지를 raise 에 붙일 수 있음

>>> class NotThreeMultipleError(Exception):
pass

>>> try:
raise NotThreeMultipleError("3의 배수가 아님")
except NotThreeMultipleError as e:
print("message :", e)

message : 3의 배수가 아님

객체(Object) : 파이썬의 모든 데이터. 인스턴스를 포함. 객체는 데이터(변수, 속성)와 코드(메소드)를 포함하는 자료구조
인스턴스(Instance) : 특정 클래스에서 생성된 객체 (특정 클래스의 '사례')

모든 인스턴스는 객체이지만, 모든 객체가 인스턴스는 아님
즉, 인스턴스는 객체의 부분집합

클래스에 dictionary 마냥 속성 추가 가능

class MyClass(): pass

my_class = MyClass()
my_class.name = "eyeballs"
print(my_class.name) # eyeballs

class 생성시 속성 초기화 실행

class MyClass():
def __init__(self, name, age):
self.name = name
self.age = age
def printing(self)
print("name : ", self.name, "age : ", self.age)

my_class = MyClass("eyeballs", 625)
my_class.printing()

여기서 self 는 클래스의 인스턴스(instance) 자신을 참조하는 변수임
__init__ 메소드를 포함한 모든 인스턴스 메서드는 호출될 때 첫 번째 인수로 자동으로 인스턴스 자신이 전달됨
따라서, self를 사용하여 클래스 내부에서 해당 인스턴스의 속성과 메서드에 접근할 수 있음

위에 MyClass 인스턴스에서 printing() 이 실행되면,
printing() 의 첫번째 인수(self) 자리에 my_class 인스턴스가 전달됨
그래서 인스턴스(my_class) 의 name 과 age 를 사용할 수 있게 됨

참고로 __init__ 은 생성자가 아니라 단지 초기화 메소드임
왜냐면, __init__ 호출 전에 이미 객체가 만들어지기 때문

class Parent(): pass
class Child(Parent): pass

issubclass(Child, Parent) # True

super() 를 사용하여 부모의 메소드를 이용하면, 부모 클래스 레벨에서 작업이 이루어짐

>>> class Parent():
def __init__(self, name):
self.parent_name = name

>>> class Child(Parent):
def __init__(self, name, age):
super().__init__(name)
self.child_age = age

>>> child = Child("eyeballs", 625)
>>> dir(child)
['__class__', '__delattr__', '__dict__', ....... 'child_age', 'parent_name']

클래스가 갖고 있지 않은 메소드 혹은 속성을 참조하면
python 은 모든 부모 클래스를 다 조사함

만약 다중 상속을 받은 경우라면, 상속받은 순서대로 조사함

>>> class High():
def name(self):
return "High"

>>> class Middle1(High):
def name(self):
return "Middle1"

>>> class Middle2(High):
def name(self):
return "Middle2"

>>> class Low1(Middle1, Middle2): pass
>>> Low1().name()
'Middle1' # Middle1 이 먼저 상속되었기 때문에 Middle1의 name() 이 호출됨

>>> class Low2(Middle2, Middle1): pass
>>> Low2().name()
'Middle2' # Middle2 가 먼저 상속되었기 때문에 Middle2의 name() 이 호출됨

python 에는 class 의 속성에 접근하지 못하게 막는 private 접근지시어 등은 없음
대신, 속성 이름을 다른 이름으로 가려서 접근을 우회하여 막는 방법이 있음

class MyClass():
    def __init__(self, name):
        self.private_name = name
    def getter(self):
        return self.private_name
    def setter(self, name):
        self.private_name = name
    public_name = property(getter, setter)

my_class =MyClass("eyeballs")
print(my_class.public_name) # MyClass 를 사용하는 시점에서 private_name 대신 public_name 을 사용함
my_class.public_name = "w"
print(my_class.public_name)

propert 에 getter 와 setter 를 주고 public_name 을 설정함으로써
private_name 에 접근하지 못하도록 함
(물론 "private_name" 이라는 키워드를 (dir 등으로) 알고 있는 개발자라면 접근 가능함.....)

class 내의 @property 는 '계산된 값'에 접근하도록 돕기도 함

class MyClass():
    def __init__(self, name):
        self.name = name
    @property
    def name_length(self):
        return len(self.name)

my_class =MyClass("eyeballs")
print(my_class.name_length) # name_length 는 메소드지만, 마치 변수에 접근하는 것 마냥 접근함
8

참고로 property 로 설정된 속성은 read-only 가 됨. 아래처럼 수정하려면, setter 를 설정해야 함

>>> my_class.name_length=1
Traceback (most recent call last):
  File "py.py", line 11, in <module>
    my_class.name_length = 1
AttributeError: can't set attribute

property 를 통해 속성 이름을 다른 것으로 바꾸는 방법 외에
dunder (double underbar) 를 통해 이름을 다르게 바꿀 수 있음

>>> class MyClass():
def __init__(self, name):
self.__name = name # 이름 앞에 dunder 를 붙이면, 인스턴스에서 __name 으로 접근이 불가능

>>> my_class = MyClass("eyeballs")
>>> dir( my_class )
['_MyClass__name', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']

__name 대신 _MyClass__name 이 새로 생김
__name 에 접근하고 싶으면 getter, setter를 사용하던가, 아니면 (숨겨진 이름의) _MyClass__name 을 사용

이렇게 변수나 메소드의 이름을 컴파일 단계에서 일정한 규칙을 통해 바꾸는 것을 맹글링이라고 함

class method 는 인스턴스가 아닌 클래스 자기 자신 전체에 영향을 미치는 메소드임
(인스턴스가 아니라) 클래스 속성을 변경하거나 새로운 객체를 생성하는 용도로 사용됨

@classmethod 데코레이터를 추가하여 class method 생성 가능

class MyClass():
    count = 0
    def __init__(self):
        MyClass.count += 1
    @classmethod
    def get_count(cls): # self 대신 cls 를 사용. 즉, class 자기 자신을 가리키며 참조함
        return cls.count

a = MyClass()
b = MyClass()
print(MyClass.get_count()) #2. 클래스를 통해 호출
print(b.get_count()) #2. 인스턴스를 통해 호출

self 는 클래스의 인스턴스 자신을 가리키는 변수이며
cls 는 클래스 자기 자신을 가리키는 변수임

class method 를 통해 인스턴스를 찍어내는 팩토리 패턴 구현 가능

class MyClass():
    def __init__(self, name, age):
        self.name = name
        self.age = age

    @classmethod
    def get_instance(cls, name, birth_year):
        return cls(name, 2025-birth_year) # cls 를 이용하여 인스턴스를 생성한 후 반환

a = MyClass.get_instance("eyeballs", 1991)
b = MyClass.get_instance("w", 1965)

print(a.name, a.age)
print(b.name, b.age)

static method 는 클래스에서 곧바로 호출 가능한 메소드. 편의(유틸리티)를 위해 존재함
@staticmethod 데커레이터를 사용하여 정의

>>> class MyClass():
@staticmethod
def read_me():
print("please read this first")

>>> MyClass.read_me()
please read this first

타입을 미리 정하는게 아니라 실행이 되었을 때 해당 Method들을 확인하여 타입을 정하는 것을 '덕타이핑'이라고 함

def func(obj):
    print(obj.name())

class A:
    def name(self):
        return self.__class__.__name__

class B:
    def name(self):
        return self.__class__.__name__

func(A()) # 'A' 출력.
func(B()) # 'B' 출력.

func 메소드의 obj 에 뭐가 들어오던간에
name() 이라는 메소드가 있다면 실행하게 됨. 그것도 런타임에.

dataclass 를 통해, 클래스에 데이터(속성)를 직관적으로 지정하여 설정할 수 있음
아래 두 가지 class 는 동일하게 name 속성을 갖는 class 임

class MyClass():
    def __init__(self, name):
        self.name = name

from dataclasses import dataclass
@dataclass
class MyDataClass():
    name : str

print(MyClass("eyeballs").name) # eyeballs
print(MyDataClass("eyeballs").name) # eyeballs

>>> from dataclasses import dataclass
>>>
@dataclass
class MyDataClass():
name: str
age: int = 30

>>> MyDataClass("A", 20)
MyDataClass(name='A', age=20)

>>> MyDataClass(20, "A") # 놀랍게도, 문법적으로 허용 됨. 왜냐면 파이썬은 동적 언어라서, 런타임에 실제 타입을 검사하거나 변환하지 않음. 위에 name: str 에서 str 은 단지 개발자한테 '이렇게 넣으셈' 하는 힌트일뿐이고, 문법적인 강제는 없음
MyDataClass(name=20, age='A')

>>> MyDataClass(name="A", age=20)
MyDataClass(name='A', age=20)

>>> MyDataClass(age=20, name="A")
MyDataClass(name='A', age=20)

>>> MyDataClass("A") # age 생략, default 로 설정한 30이 대신 사용됨
MyDataClass(name='A', age=30)

타입을 강제하고 싶다면, __post_init__ 에서 체커를 추가함
__post_init__ 은 dataclass 의 __init__ 메서드가 호출된 후 자동으로 실행되는 메서드(__init__ 이 __post_init__ 을 호출함)

from dataclasses import dataclass

@dataclass
class MyDataClass:
    name: str

    def __post_init__(self):
        if not isinstance(self.name, str):
            raise TypeError(f"Expected str, got {type(self.name).__name__}")

MyDataClass(name="A")
MyDataClass(name=20)  # exception

데이터 클래스는 매직메소드(duner 가 들어가는 메소드들)를 자동으로 만들어주기도 함
예를 들어, __init__, __repr__, __eq__ 같은 메소드들을 자동으로 만들어 줌
위에서 __init__ 없이 데이터 클래스를 정의할 수 있었던 것도 다 이런 이유에서였음

>>> class A:
def __init__(self, name):
self.name = name

>>> @dataclass
class B:
name : str

>>> print(A("eyeballs"))
<__main__.A object at 0x01562D00> # __repr__ 가 존재하지 않아, A 클래스의 메모리값을 대신 반환

>>> print(B("eyeballs")) # dataclass 가 __repr__ 를 대신 생성해주어, B 의 name 을 포함하여 반환
B(name='eyeballs')

mymodule.py 과 mycode.py 가 한 dir에 존재하는 경우 다음과 같이 바로 import 가능

< mymodule.py >
from random import choice
mylist = [1,2,3,4,5]
def pick():
return choice(mylist)

< mycode.py >
import mymodule
print(mymodule.pick())

혹은

< mycode.py >
from mymodule import pick
print(pick())

혹은

< mycode.py >
from mymodule import pick as p
print(p())

"mypackage" 라는 dir 안에 mymodule.py 과 mymodule2.py 를 넣어둠
mycode.py 는 mypackage dir 와 동일한 위치에 존재함

< mypackage/mymodule.py >
from random import choice
mylist = [1,2,3,4,5]
def pick():
return choice(mylist)

< mypackage/mymodule2.py >
from random import choice
mylist = [6,7,8,9,10]
def pick():
return choice(mylist)

mypackage dir 안에 있는 mymodule.py, mymodule2.py 를 아래와 같이 from, import 로 나눠 불러올 수 있음

< mycode.py >
from mypackage import mymodule, mymodule2
print(mymodule.pick())
print(mymodule2.pick())

혹은

from mypackage.mymodule import pick
from mypackage import mymodule2
print(pick())
print(mymodule2.pick())

위와 같이 하위 dir 에서 module 을 불러올 때 from, import 를 사용함
그럼 from random 은 어디서 불러올까?
이 모듈 파일은, 현재 작업중인 dir 이내에 없기 때문에, python 이 다른 위치에서 해당 module 을 불러옴
그 "다른 위치"라는 곳은 아래와 같이 확인 가능

>>> import sys
>>> for p in sys.path:
print("\"",p,"\"")

" "
"C:\Users\EYE\Python\"
"C:\Users\EYE\Python\Python38-32\Lib\idlelib"
"C:\Users\EYE\Python\Python38-32\python38.zip"
"C:\Users\EYE\Python\Python38-32\DLLs"
"C:\Users\EYE\Python\Python38-32\lib"
"C:\Users\EYE\Python\Python38-32"
"C:\Users\EYE\Python\Python38-32\lib\site-packages"

(현재 윈도우에서 작업중이라 위와 같은 경로들이 나타남)

가장 먼저 path 에 나타나는 것이 빈 문자열(" ") 임
이 말은, python 실행시 현재 dir 를 기준으로 먼저 module 을 찾는다는 의미임

임의의 path 를 추가하려면 아래와 같이 실행

import sys
sys.path.insert(0, "C:\Users\EYE\my\python\module\path") # 0순위로 찾게 됨

module 경로를 상대적으로 넣어줄 수 있음
예를 들어 mycode.py 와 동일한 dir 위치에 있는 mymodule.py 을 불러오려면

< mycode.py >
from . import mymodule
...

mycode.py 보다 상위 dir 위치에 있는 mymodule.py 을 불러오려면

< mycode.py >
from .. import mymodule
...

mycode.py 보다 상위 dir 위치의 mypackage 에 있는 mymodule.py 을 불러오려면

< mycode.py >
from ..mypackage import mymodule
...

놀랍게도, import 한 module 의 값을 직접 업데이트 할 수 있음
module 을 import 한 프로그램에 module 의 사본이 생성된다고 이해하면 됨
다시 import 해도 동일한 업데이트가 반영되며,
나중에 다른 프로그램에서 동일한 module 을 import 하면 그에 맞춰 새로운 사본이 생성되기 때문에
pi=3 으로 업데이트 한 내용이 다른 프로그램에 영향을 미치지 않음

>>> import math
>>> math.pi
3.141592653589793
>>> math.pi = 3
>>> math.pi
3
>>> import math
>>> math.pi
3

Deque 는 스택과 큐의 기능을 모두 갖고 있음
즉, 양쪽으로 pop 이 가능함

>>> from collections import deque
>>> dq = deque([1,2,3,4,5])
>>> print(dq.pop()) # 가장 마지막에서 pop
5
>>> print(dq)
deque([1, 2, 3, 4])
>>> print(dq.popleft()) # 가장 처음에서 pop
1
>>> print(dq)
deque([2, 3, 4])

여러 시퀀스들을 차례대로 순회하기 위해 itertools.chain 을 사용

>>> import itertools
>>> for item in itertools.chain([1], [2,3], (4,5,6)): # 3개의 다양한 시퀀스를 넣음
print(item)

1
2
3
4
5
6

하나의 시퀀스를 순회하며 누적 계산을 하기 위해 itertools.accumulate 를 사용

>>> import itertools
>>> for item in itertools.accumulate([1,2,3,4]):
print(item)

1
3
6
10

기본적으로 누적 합계를 계산함
합계가 아닌 다른 누적 계산을 진행하려면, def 를 추가로 넣어주면 됨

>>> def mul(a,b):
return a*b
>>> for item in itertools.accumulate([1,2,3,4], mul):
print(item)

1
2
6
24

from pprint import pprint

pprint 는 일반 print 보다 훨씬 가독성 좋게 출력해줌
dictionary 를 출력하면 정렬도 해 줌....

python 은 pip 를 통해 PyPI(Python package index. https://pypi.org) 로부터 패키지를 다운받아 설치할 수 있음
"패키지를 설치"한다는 것의 의미는, 해당 패키지의 코드와 관련 종속성(dependencies)을 Python 환경에 다운로드하고
적절한 위치에 배치하여 사용할 수 있도록 등록하는 과정을 의미
여기서 말하는 '적절한 위치'란, site-packages(Python의 라이브러리 디렉터리)를 의미함.
이 site-packages 에 패키지 파일을 복사함
패키지 설치 이후에 python 스크립트 내에서 import 를 통해 패키지 및 모듈 사용이 가능하게 됨

패키지 파일 위치는 아래와 같이 확인 가능

import requests
print(requests.__file__) # 패키지의 실제 경로 출력

일반적으로 pip 로 설치한 패키지는 모든 python 프로젝트에서 사용 가능함 (global installation)
어느 특정 python 프로젝트 에서만 사용 가능하도록 패키지를 설치하려면
venv 또는 conda 같은 가상 환경을 사용해야 함

pip list # 설치된 패키지 목록 확인
pip show requests # 특정 패키지 정보 확인

pip 로 패키지 설치하기
pip install flask
pip install flask==0.9.0
pip -r installations.txt # 해당 txt 파일에는 패키지 이름들이 개행 간격으로 적혀있고, txt 파일 내 모든 패키지가 설치됨

pip install --upgrade # 설치된 모든 패키지를 최신 패키지로 업그레이드
pip uninstall requests # 패키지 삭제

Python의 가상환경은 독립적인 실행 환경을 만들어서
각 프로젝트마다 별도의 패키지 및 Python 버전을 관리할 수 있도록 도와주는 기능

global package 와 다른 버전을 사용해야 할 때 사용되며,
venv, virtualenv, conda 등이 있음

< venv (Python 내장 가상환경) >
Python 3.3 이상에서 기본 제공되는 가상환경 도구

가상 환경 생성 명령어 : python -m venv my_env
my_env 라는 dir 가 생성되고, 이 dir 내에 가상환경 관련 파일이 저장됨
해당 프로젝트에서 설치한 패키지들이 이 my_env dir 에 설치됨

가상환경 활성 명령어 : source my_env/bin/activate

가상환경 비활성 명령어 : deactivate

가상환경 제거 명령어 : rm -rf my_env
(그냥 dir 를 지우는 거네)

< virtualenv (확장 기능이 있는 가상환경) >
venv보다 더 다양한 기능 제공함
Python 2.x 및 3.x 모두 지원하며,
하나의 가상환경에서 여러 Python 버전 사용 가능.

virtualenv 설치 명령어 : pip install virtualenv

가상환경 생성 명령어 : virtualenv my_env
특정 python 버전으로 가상환경 생성하는 명령어 : virtualenv -p /usr/bin/python3.8 my_env

가상환경 활성 명령어 : source my_env/bin/activate

가상환경 비활성 명령어 : deactivate

< conda (데이터 과학 및 패키지 관리 특화) >
데이터 과학, 머신러닝, 딥러닝 등에 최적화된 가상환경을 제공함
pip와 달리 Python 패키지뿐만 아니라 비(非)Python 패키지도 설치 가능 (예: numpy, tensorflow, R).
virtualenv 와 동일하게, 여러 Python 버전 사용 가능.

conda 를 통해 가상환경을 설정하려면, Anaconda 를 설치해야 함
설치 후 anaconda 가 제공하는 명령어를 통해 가상환경을 구축할 수 있음

특정 python 버전으로 가상환경 생성하는 명령어 : conda create --name my_env python=3.8

가상환경 활성 명령어 : conda activate my_env

가상환경 비활성 명령어 : conda deactivate

가상환경 제거 명령어 : conda remove --name my_env --all

주피터 노트북 설치 명령어 : pip install jupyter
주피터 노트북 실행 명령어 : jupyter notebook

주피터lab 설치 명령어 : pip install jupyterlab
주피터lab 실행 명령어 : jupyter lab

python unittest 사용 예제
첫 글자를 대문자로 바꾸는 함수를 테스트 할 예정

import unittest
def func(text):
    if text is None : return text
    try:
        if i := int(text) : return text
    except: pass
    return text.capitalize()

class Test(unittest.TestCase):

    def setUp(self): pass  # 테스트가 진행되기 전 실행되는 메소드
    def tearDown(self): pass  # 테스트가 마무리 된 후 실행되는 메소드

    def test1(self):
        text = "eyeballs"
        result = func(text)
        self.assertEqual(result, "Eyeballs")

    def test2(self):
        text = "hi eyeballs"
        result = func(text)
        self.assertEqual(result, "Hi eyeballs")

    def test3(self):
        text = 123
        result = func(text)
        self.assertEqual(result, 123)

    def test4(self):
        text = None
        result = func(text)
        self.assertEqual(result, None)

    def test5(self):
        text = True
        result = func(text)
        self.assertEqual(result, True)

if __name__ == '__main__' :
    unittest.main()

테스트 성공

C:\Users\EYE\Desktop\python>python mycode.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.001s

OK

테스트 실패

C:\Users\EYE\Desktop\python>python mycode.py
.F..F
======================================================================
FAIL: test2 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "mycode.py", line 24, in test2
    self.assertEqual(result, "Hi Eyeballs")
AssertionError: 'Hi eyeballs' != 'Hi Eyeballs'
- Hi eyeballs
?    ^
+ Hi Eyeballs
?    ^

======================================================================
FAIL: test5 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "mycode.py", line 39, in test5
    self.assertEqual(result, None)
AssertionError: True != None

----------------------------------------------------------------------
Ran 5 tests in 0.003s

FAILED (failures=2)

로깅 모듈 사용하여 로깅할 때 필요한 개념들

메세지 : 로그 메세지
레벨 : 로깅의 심각한 정도 debug, info, warning, error, critical
로거 logger : 모듈과 연결되는 하나 이상의 객체
핸들러 handler : 터미널, 파일, DB 등으로 메세지를 전달하는 역할
포매터 formatter : 결과를 생성
필터 filter : 입력 기반으로 판단

import logging
logging.debug("debug")
logging.info("info")
logging.warning("warning")
logging.error("error")
logging.critical("critical")

출력 결과
WARNING:root:warning
ERROR:root:error
CRITICAL:root:critical

기본적으로 debug, info 는 결과를 출력하지 않고 warning, error, critical 만 출력함
basicConfig 를 통해 기본 level 을 debug 로 지정하면,
이후부터 debug 부터 critical 까지 전부 출력

logging.basicConfig(level=logging.DEBUG)

출력 결과
DEBUG:root:debug
INFO:root:info
WARNING:root:warning
ERROR:root:error
CRITICAL:root:critical

로거를 설정해두면, 어떤 위치에서 어떤 로거에 의해 출력되었는지 따로 구분할 수 있음

import logging
logging.basicConfig(level=logging.DEBUG)

A_logger = logging.getLogger('A')
B_logger = logging.getLogger('B')

A_logger.warning("warning")
B_logger.critical("critical")

출력 결과
WARNING:A:warning
CRITICAL:B:critical

basicConfig 에 filename 을 추가하면, 로그를 stdout 이 아니라 file 로 저장할 수 있음

import logging
logging.basicConfig(level=logging.DEBUG, filename='warning.log')
logging.warning("warning")

프로그램을 실행한 위치에 warning.log 가 생성됨

basicConfig 에 format 을 추가하면, 로그의 포맷을 변경할 수 있음

import logging
fmt = '%(asctime)s %(levelname)s %(lineno)s %(message)s'
logging.basicConfig(level=logging.DEBUG, format=fmt)
logging.warning("warning")

출력 결과
2025-01-30 21:24:58,157 WARNING 4 warning

Python의 Global Interpreter Lock (GIL)
멀티스레드 환경에서 한 번에 하나의 스레드만 Python 바이트코드를 실행할 수 있도록 하는 메커니즘

Python 은 분명 멀티스레드를 지원하지만,
GIL 에 의해 동시에 두 개 이상의 스레드가 Python 코드 실행을 병렬로 수행할 수 없음

GIL 은 여러 스레드가 하나의 자원을 수정하면 발생하는 race condition 을 방지하기 위해 존재하며,
여러 스레드가 동시에 실행될 시 GC 실행이 불안정해지는 것을 방지하기 위함 (메모리 관리 안정성을 보장)
(이 GC 는 순환 참조 문제를 해결하기 위해 좀 더 복잡한 기법을 사용하는 GC라고 함...)

< CPU bound 작업 실행시 >

GIL은 멀티코어 CPU의 성능을 제대로 활용하지 못하게 만듦
실제로 멀티스레딩을 사용해도 실제 성능 향상이 거의 없음.
GIL이 있기 때문에 여러 개의 CPU 코어를 제대로 활용할 수 없음

따라서 멀티프로세싱(multiprocessing) 대신 사용하거나,
GIL 을 지원하지 않는 Numpy, Pandas, 혹은 PyPy 를 사용

< IO bound 작업 실행시 >

GIL은 입출력(I/O) 중심 작업에서는 큰 영향을 주지 않음
Python은 GIL을 사용하지만, I/O 작업을 수행하는 동안에는 GIL을 자동으로 해제함.
따라서 멀티스레딩이 성능 향상에 도움이 될 수 있음.

python 을 통해 새로운 process 를 실행하여 shell 명령어를 실행할 수 있음.
(shell 명령어를 실행한다는 의미는, process 를 하나 실행한다는 의미가 됨)

import subprocess
result = subprocess.getoutput('ls -laht | grep 2025 | wc -l') # shell 명령어를 수행 후 stdout/stderr 결과를 반환
result = subprocess.getstatusoutput('ls -laht') # shell 명령어 수행 후 stdout/stderr 결과와 상태 코드를 튜플로 반환
result = subprocess.call('date') # shell 명령어 수행 후 결과의 상태 코드(0(성공) 혹은 0 외의 값)만 반환
result = subprocess.call(['date', '-u']) # shell 명령어를 수행 후 stdout/stderr 결과를 반환
result = subprocess.call('date -u', shell=True) # shell 명령어를 수행 후 stdout/stderr 결과를 반환

os 를 통해서도 shell 명령어 실행 가능

import os
result = os.system("ls -laht") # shell 명령어 수행 후 결과의 상태 코드(0(성공) 혹은 0 외의 값)만 반환

python multiprocessing 예제

import multiprocessing
import os

def target_func(name): # multiprocessing 으로 처리 될 함수
    print(f"process id : {os.getpid()} name : {name}")

def run_multi_process():
    target_func("main process")

    p1 = multiprocessing.Process(target=target_func, args=('first multiprocess',)) # args 는 튜플로 넣어줘야 함
    p1.start()
    print(f"p1) process id : {p1.pid} name : {p1.name}")

    p2 = multiprocessing.Process(target=target_func, args=('second multiprocess',))
    p2.start()
    print(f"p2) process id : {p2.pid} name : {p2.name}")

    p3 = multiprocessing.Process(target=target_func, args=('third multiprocess',))
    p3.start()
    print(f"p3) process id : {p3.pid} name : {p3.name}")

if __name__=='__main__':
    run_multi_process()

결과
process id : 16664 name : main process

p1) process id : 10564 name : Process-1
p2) process id : 11740 name : Process-2
p3) process id : 9108 name : Process-3

process id : 10564 name : first multiprocess
process id : 11740 name : second multiprocess
process id : 9108 name : third multiprocess

위에 p1.pid 와 (target_func 내의) os.getpid() 가 동일한 것을 확인할 수 있음
즉, target_func 내에서 실행되는 명령어들은 모두 sub process 에서 동작함

multiprocessing.process 를 종료하려면 아래와 같이 terminate 실행

p1.terminate()
p2.terminate()
p3.terminate()

python multi threading 예제

import threading
import os

def target_func(name):
    print(f"process id : {os.getpid()} name : {name}")

def run_multi_thread():
    target_func("main process")
    t1 = threading.Thread(target=target_func, args=('first thread',))
    t1.start()
    print(f"t1) thread ident : {t1.ident} native_id : {t1.native_id} name : {t1.name} ")

    t2 = threading.Thread(target=target_func, args=('second thread',))
    t2.start()
    print(f"t2) thread ident : {t2.ident} native_id : {t2.native_id} name : {t2.name}")

    t3 = threading.Thread(target=target_func, args=('third thread',))
    t3.start()
    print(f"t3) thread ident : {t3.ident} native_id : {t3.native_id} name : {t3.name}")

if __name__=='__main__':
    run_multi_thread()

결과

process id : 8900 name : main process

process id : 8900 name : first thread
t1) thread ident : 19272 native_id : 19272 name : Thread-1

process id : 8900 name : second thread
t2) thread ident : 13024 native_id : 13024 name : Thread-2

process id : 8900 name : third thread
t3) thread ident : 364 native_id : 364 name : Thread-3

process id 는 모두 동일한 것을 확인
ident 는 python 내부에서 관리하는 thread 의 고유 id 이며, os 레벨에서 관리할 수 없음
native_id 는 OS 가 관리하는 실제 thread 의 고유 id 이며, os 의 프로세스 관리도구(ps 등)로 확인 가능

index 를 추가하여 for 문을 도는 방법으로
for i in range(... 를 사용하는데,
enumerate 를 사용한다면, 굳이 range 를 사용하지 않아도 index 추가하여 for 문을 돌 수 있음
enumerate 를 사용하여 for 문을 돌면, 결과로 index 와 element 가 묶여져서 나옴

- for n in enumerate([321, 523, 447]): print(n)
결과)
(0, 321)
(1, 523)
(2, 447)

- for index, n in enumerate([321, 523, 447]): print(index, n)
결과)
0, 321
1, 523
2, 447

- for n in enumerate([321, 523, 447], start = 7): print(n)
결과)
(7, 321)
(8, 523)
(9, 447)

저작자표시 비영리 동일조건 (새창열림)

'Python3' 카테고리의 다른 글

[Python] pyenv 설치 방법 (0)	2022.08.10
[Python] 내장함수, 외장함수 공식 문서 (0)	2022.06.16
[Python] 문자열의 중간 데이터 제거 코드 (0)	2021.09.06
[Python] 공부할 때 참고한 곳 (0)	2021.05.16
[PySpark] 문법 예제 : expr (0)	2021.05.05

[Scala] cheating sheet 정리

눈가락 2024. 8. 18. 16:41

2024. 8. 18. 16:41

Scala 언어 문법책 공부 후 핵심만 정리함

Scala 는 JVM 언어임. 자바 런타임을 사용하여 실행됨

Literal : 숫자 5, 문자 'A', 문자열 "eyeballs" 처럼 소스 코드에 바로 등장하는 데이터

값(value) : 불변의 타입을 갖는 저장 단위. 정의될 때 데이터가 할당되며 재할당 불가능

변수(variable) : 가변의 타입을 갖는 저장 단위. 정의될 때 데이터가 할당되며 재할당 가능

타입(type) : 데이터의 종류, 정의, 분류. Scala 의 모든 데이터는 특정 타입에 대응하며, 모든 Scala 타입은 그 데이터를 처리하는 메소드를 갖는 클래스로 정의됨

불변의 값(value) 를 사용하면, 다른 어떤 코드에서 접근하더라도 같은 값을 유지하는 안정성을 갖출 수 있어

코드를 읽고 디버깅하는 일이 더 쉬워짐

동시 또는 멀티 스레드 코드에서 값을 사용하여 에러 발생 가능성을 낮출 수 있음

이름	설명	크기	최솟값	최댓값
Byte	부호 있는 정수	1byte	-128	127
Short	부호 있는 정수	2byte	-32768	32767
Int	부호 있는 정수	4byte	-2^31	2^31 - 1
Long	부호 있는 정수	8byte	-2^63	2^63 - 1
Float	부호 있는 부동 소수	4byte	n/a	n/a
Double	부호 있는 부동 소수	8byte	n/a	n/a

Literal	Type	설명
5	Int	접두사/접미사 없는 정수 리터럴은 기본이 Int
0x0f	Int	접두사 0x : 16진수
5l	Long	접미사 l : Long 타입
5.0	Double	접두사/접미사 없는 소수 리터럴은 기본이 Double
5f	Float	접미사 f : Float 타입
5d	Double	접미사 d : Double 타입

정규 표현식

matches : 정규 표현식이 전체 문자열과 맞으면 true

"abc" matches ".*" : true

replaceAll : 일치하는 문자열을 모두 치환

"abc" replaceAll ("b|c","a") : "aaa"

replaceFirst : 첫번째로 일치하는 문자열을 치환

"abc" replaceFirst ("b|c","a") : "aac"

bool 비교 연산자인 &&,|| 은 게을러서 첫 번째 인수로 충분하다면 두 번째 인수는 평가하지 않음

&, | 은 결괏값 반환하기 전에 늘 두 인수를 모두 검사함

데이터 타입 연산

예제	설명
val aa = 5 myVal.asInstanceOf[Long]	원하는 타입으로 전환
val aa = 5 myVal.getClass	해당 값의 타입(=Class) 반환
val aa = 5 myVal.isInstanceOf[Int]	해당 값이 넣어준 타입에 해당하는지 확인
val aa = 5 myVal.hashCode	해당 값의 해시코드 반환
val aa = 5 myVal.toDouble	형변환
val aa = 5 myVal.toString	해당 값을 string 으로 변환

Tuple 은 선언 방법이 두 가지이며, 인덱스는 1부터 시작함

0부터 시작하지 않는 문제아. 아주 자기 멋대로야..

val myTuple = (1, "2", '3', "four", 5)

myTuple._1 // 1

myTuple._2 // "2"

val myTuple = 1 -> "2"

myTuple: (Int, String) = (1, "2")

표현식을 아래처럼 활용 가능

val result = { val x = 6*25; x*100 }

result: Int = 15000

여기서 x는 표현식 내에서만 사용 가능하며

표현식 바깥에서는 접근 불가

if ( true ) println("a")

결과 : a

if (1 > 2) println("a") else println("b")

결과 : b

if 문과 대체하여 사용 가능한 match 표현식은 아래처럼 사용 가능

val max = (1>2) match {

case true => 1

case false => 2

}

결과 : 2

val message = 500 match {

case 200 => "ok"

case 400 => { println("400"); "error" }

case 500 => { println("500"); "error" }

}

결과 : 500 이 출력되고 "error" 가 message 에 들어감

val kind = "WED" match {

case "MON" | "TUE" | "WED" | "THU" | "FRI" => "weekday"

case "SAT" | "SUN" => "weekend"

}

결과 : weekday

val ifstatement = -1 match {

case x if x > 0 => "plus"

case x if x < 0 => "minus"

case x => "zero"

}

결과 : minus

val other = 300 match {

case 200 => "ok"

case other => "error"

}

결과 : error

val wilecard = 300 match {

case 200 => "ok"

case _ => "error"

}

결과 : error

타입으로도 매칭이 가능

val x:Int = 123

x match {

case x: String => "String"

case x: Int => "Int"

}

결과 : Int

for 루프는 일정 범위의 데이터를 반복하며, 반복할 때마다 표현식을 실행함

yield 를 추가하면 반환값들을 컬렉션으로 돌려줌

(각 원소마다 특정 표현식을 적용하고 반환하는 것이 마치 map 과 비슷함)

for ( i <- 1 to 5 ) { print(i+" ") }

결과 : 1 2 3 4 5

for ( i <- 1 until 5 ) { print(i+" ") }

결과 : 1 2 3 4

val result = for ( i <- 1 to 5 ) yield { i }
결과 : results: scala.collection.immutable.IndexedSeq[Int] = Vector(1,2,3,4,5)

for ( i <- result ) { print(i+" ") }
결과 : 1 2 3 4 5

for ( x <- 1 to 2 ; y <- 1 to 3 } { print(s"($x,$y)") }

결과 : (1,1)(1,2),(1,3)(2,1)(2,2)(2,3)

아래와 같이 for문 안에 조건식을 넣을 수 있음

for ( i <- 1 to 10 if i%2!=0) print(i+" ")

결과 : 1 3 5 7 9

아래와 같이 여러개의 조건절을 넣을 수 있으며, 조건들은 서로 and 로 엮임

for {

c <- "a,bc,,d,,ef".split(",")

if c != null

if c.size > 1

} print(c+" ")

결과 : bc ef

아래와 같이 for 루프 안에서만 쓰이는 임시 변수를 지정하여 사용 가능

아래 예제의 x 는 for 루프가 반복 할 때마다 매번 정의되고 할당됨

for ( i <- 1 to 5 ; x = i * i ) print(x+" ")

결과 : 1 4 9 16 25

while 루프도 사용 가능함

var x = 5

while( x > 0 ) x-=1

결과 : x 는 0이 됨

순수 함수는 아래의 조건들을 만족하는 함수임

- 하나 이상의 입력 매개변수를 가짐

- 입력 매개변수만을 이용하여 계산 수행

- 값을 반환함

- 동일 입력에 대해 항상 같은 값을 반환

- 함수 외부의 어떤 데이터도 사용하거나 영향을 주지 않음

- 함수 외부 데이터에 영향을 받지 않음

순수 함수는 상태 정보를 유지하지 않으며, 외부 데이터에 관계없이 독립적임

본질적으로 순수함수는 변경될 수 없어서 안정적임

(마치 수학에서 사용하는 함수와 동일한 성질을 갖음)

Scala 프로그래밍을 하면서 순수 함수 비율을 많이 늘리는 게 좋다고 함

함수는 def 를 사용하여 선언하며, 선언문 다음에는 표현식이 옴

선언 : def hi = "hi"

사용 : hi

선언 : def hi = {"hi"}

사용 : hi

선언 : def hi:String = {val result = "eyeballs", result}

사용 : hi

결과 : eyeballs

선언 : def hi() = "hi"

사용 : hi 혹은 hi()

선언 : def hi(name:String) = "hi "+name

사용 : hi("eyeballs")

결과 : hi eyeballs

선언 : def hi(name:String):String = {

if (name != null) return "hi "+name

else return "it's null"

}

사용 : hi("eyeballs"), hi(null)

결과 : "hi eyeballs", "it's null"

선언 : def hi(name:String):String = {

if (name != null) return "hi "+name

"it's null"

}

사용 : hi("eyeballs"), hi(null)

결과 : "hi eyeballs", "it's null"

선언 : def hi(name:String, ch:Char) = "hi "+name+ch

사용 : hi("eyeballs", '!')

결과 : hi eyeballs!

선언 : def hi(name:String, ch:Char) = "hi "+name+ch

사용 : hi(name = "eyeballs", ch = '!') , hi(ch = '!', name = "eyeballs")

결과 : hi eyeballs!

선언 : def hi(name:String, ch:Char = '!') = "hi "+name+ch

사용 : hi("eyeballs"), hi(name = "eyeballs")

결과 : hi eyeballs!

선언 : def hi(name:String)(ch:Char) = "hi "+name+ch

사용 : hi("eyeballs")('!')

결과 : hi eyeballs!

아래처럼 인수 개수가 가변적일 때 * 를 사용하여 처리 가능

선언 : def sum(items: Int*): Int = {

var total = 0

for ( i <- items) total += i

total

}

사용 : sum(0), sum(1,2), sum(1,2,3,4,5)

결과 : 0, 3, 15

아래처럼 사용시 표현식을 사용할 수 있음

아래 에제에서 myname 은 호출할 때만 잠깐 사용되는 값

선언 : def hi(name:String) = "hi "+name

사용 : hi {val myname="eyeballs"; hi(myname)}

결과 : hi eyeballs

프로시저란 반환값을 갖지 않는 함수이며, 이 때 함수의 반환값은 Unit 이 됨

예를 들어 아래 두 함수는 동일한 함수임

def log(d:Double) = println("value : "+d)

def log(d:Double):Unit = println("value : "+d)

아래처럼 함수 안에 함수 중첩 가능

Scala 는 오버로딩이 가능하기 때문에, 아래 예제의 파라미터 3개짜리 max와 파라미터 2개짜리 max 는 서로 다른 함수임

파라미터 2개짜리 max는 파라미터 3개짜리 max 내에서만 사용 가능 (max 함수 바깥에서 사용 불가)

선언 : def max(a:Int, b:Int, c:Int) = {

def max(x:Int, y:Int) = if (x>y) x else y

max(a, max(b, c))

}

사용 : max(1,2,3)

결과 : 3

Generic 처럼 함수 내에서 다루는 타입 자체를 사용시 직접 넣어줄 수 있음

이것을 타입 매개변수라고 부르며 [ ] 를 사용하여 선언함

마지막 예제처럼 타입 매개변수를 넣어주지 않아도 추론이 가능한 경우는 에러가 나지 않음

선언 : def identity[TYPE] (x:TYPE):TYPE = x

사용 : identity[Int](1), identity[String]("eyeballs"), identity[Double](0.1), identity("eyeballs")

결과 : 1, eyeballs, 0.1, eyeballs

클래스 내 메소드를 호출할 때 dot 을 사용할 수 있지만, white scape 를 사용하는 것도 가능

아래 세 가지는 모두 같은 의미를 갖음

"eyeballs".endsWith("s")

"eyeballs" endsWith("s")

"eyeballs" endsWith "s"

Scala 에서 함수는 일급 객체임

일급 객체란, 일반적인 데이터 타입처럼 언어의 모든 부분에 사용 가능한 객체를 말 함

일급 함수는 값, 변수 등의 컨테이너에 저장될 수 있고

다른 함수의 매개변수로 사용되거나

다른 함수의 반환값으로 사용될 수 있음

다른 함수를 매개변수로 받아들이거나

다른 함수를 반환값으로 반환하는 함수를 고차함수 라고 함

함수는 일급객체이기 때문에, 아래처럼 함수를 값에 넣을 수 있음

def hi = "hi"

val copyHi = hi

사용 : copyHi

def hi() = "hi"

val copyHi = hi

사용 : copyHi()

def hi(name:String) = "hi "+name

val copyHi = hi

사용 : copyHi("eyeballs")

def hi(name:String, ch:Char) = "hi "+name+ch

val copyHi = hi _

사용 : copyHi("eyeballs", '!')

def hi(name:String)(ch:Char) = "hi "+name+ch

val copyHi = hi _

사용 : copyHi("eyeballs")('!')

함수는 일급객체이기 때문에, 아래처럼 함수의 파라미터에 함수를 넣을 수 있음

def reverser(s: String) = s.reverse

def hi(name: String, f:String => String) = {

if(name!=null) f(name)

else name

}

사용 : hi("abcde" reverser), hi(null, reverser)

결과 : edcba, null

아래처럼 함수 리터럴(익명함수) 를 사용하여 바로 값에 할당 가능

선언 : val hi = (s: String) => s.reverse

사용 : hi("abc")

선언 : def hi(name: String, f:String => String) = { if(name!=null) f(name) else name }

사용 : hi("abcde", s => s.reverse) , hi("abcde", (s:String) => s.reverse)

선언 : val logging = () => "logging..."

사용 : println(logging())

하지만 아래처럼 함수 리터럴을 선언에 사용하는 것은 안 됨

def hi(name:String, s=>s.reverse) = {....} //안 됨

def hi(name:String, (s:String)=>s.reverse) = {....} //안 됨

자리표시자 구문은 함수 리터럴의 축약형임

지정된 매개변수를 와일드카드( _ )로 대체함

함수의 명식적 타입이 리터럴 외부에 지정되어 있고, 매개변수가 한 번 이상 사용되지 않는 경우에만 자리표시자 사용 가능

예를 들어 다음과 같이 사용 가능

val double: Int => Int = _*2

Int => Int 를 통해 함수에 입력값이 int 인 것을 알려주었고,

함수 내부에서 _ 가 한 번만 사용되었음

비슷한 예로 아래와 같이 사용 가능

선언 : val rev: String => String = _.reverse

사용 : rev("eyeballs")

선언 : def hi(name: String, f:String => String) = { if(name!=null) f(name) else name }

일반 사용 : hi("abcde", s => s.reverse) , hi("abcde", (s:String) => s.reverse)

자리표시자 사용 : hi("abcde", _.reverse)

선언 : def combination(x:Int, y:Int, f:(Int,Int)=>Int) = f(x,y)

일반 사용 : combination(2,3, (x,y) => x*y)

자리표시자 사용 : combination(2,3, _ * _)

선언 : def combination[A,B](x:A, y:A, f:(A,A)=>B) = f(x,y)

일반 사용 : combination[Int,Double](2,3, (x,y) => x/y*1.0)

자리표시자 사용 : combination[Int,Double](2,3, _ / _ * 1.0)

함수를 val 에 넣을 때 파라미터를 고정시킨 후에 넣을 수 있음

이것을 부분 적용 함수라고 함

예를 들어, 아래 함수에 들어가는 x, y 두 파라미터 중 하나는 고정하고 싶다면,

자리표시자를 사용한 부분에만 파라미터를 받을 수 있게 만들면 됨

def factorOf(x:Int, y:Int) = y%x == 0

val multipleOf3 = factorOf(3, _:Int)

사용 : factorOf(3, 10), multipleOf3(10)

아래처럼 부분 적용 함수를 만들어서 val 에 넣을 수 있음

val f = factorOf _

val f = factorOf(_, _)

val f = factorOf(3, _)

val f = factorOF(3, _:Int)

def factorOf(x:Int)(y:Int) = y%x == 0

val multipleOf3 = factorOf(3) _

사용 : multipleOf3(4), multipleOf3 {val a=2; a*2}

함수에 "이름에 의한 매개변수(call by name)"를 사용하면

리터럴 값이 와도 되고, 함수가 와도 됨

예를 들어 f 라는 함수의 매개변수로 이름에 의한 매개변수를 넣음

def doubles(x: => Int) = { print(s"got ${x} from doubles"); x*2 }

이 doubles 함수에 넣을 파라미터로 2, 5 등의 리터럴 값이 들어갈 수 있음

doubles(2) //결과 got 2 from doubles, 4

doubles(5) //결과 got 5 from doubles, 10

또한 함수 자체가 들어갈 수 있음

def f(a:Int) = {println(s"got ${a} from f"); a}

doubles(f(4)) //결과 got 4 from f, got 4 from doubles, for 4 from f, 8

이렇게 들어간 f 는 doubles 내부에서 사용 될 때마다 호출됨

위 출력 결과에도 from f 가 두 번 떴음, 왜냐하면 doubles 내부에 x 가 두 번 사용되기 때문

case 를 추가하여 함수에 파라미터를 특정지어 처리할 수 있음

선언 :

val caseFunc : Int=>String = {

case 1 => "one"

case -1 => "minus one"

case _ => "the others"

}

사용 : caseFunc(1), caseFunc(-1), caseFunc(3)

Scala 의 class 는 Java 와 마찬가지로 new 로 생성할 수 있음

선언 : class User

생성 : val user = new User

확인 : user.isInstanceOf[User] //true

user.isInstanceOf[AnyRef] //true

user.isInstanceOf[Any] //true

선언 :

class User {

val name : String = "eyeballs"

def greet : String = s"hello ${name}"

override def toString = s"user name is ${name}"

}

생성 : val user = new User

사용 : user.greet, new User().greet

아래처럼 생성자를 넣을 수 있으며, 이를 클래스 매개변수라고 함

클래스 매개변수 n 은 메소드(greet, toString 등) 내부에서 사용 불가능함

클래스 매개변수n 은 단지 필드를 초기화하거나 메소드에 전달되는 용도로만 사용됨

class User(n: String) {

val name : String = n

def greet : String = s"hello ${name}"

override def toString = s"user name is ${name}"

}

클래스 매개변수 n 앞에 val, var 를 붙이면

n 은 클래스의 필드가 되기 때문에, 내부 메소드에서도 사용 가능

class User(val n: String) {

def greet : String = s"hello ${n}"

override def toString = s"user name is ${n}"

}

class A {

override

}

class B extends A

class C extends B {

override def toString = "C : "+getClass.getName

}

val a:A = new B

bal b:B = new A //에러남. 자식은 부모를 받아줄 수 없음

선언 :

class Car( val make: String, var reserved: Boolean ) {

def reserve(r: Boolean): Unit = {reserved = r}

}

생성 : val t = new Car("eyeballs company", false), val t = new Car(make = "eyeballs company", reserved = false)

사용 : t.reserve(true)

선언 :

class Car( val make: String = "eyeballs company", var reserved: Boolean = true, val year: Int = 2024) {

def reserve(r: Boolean): Unit = {reserved = r}

}

생성 : val t = new Car(), val t = new Car(year=2025)

선언 :

class myClass[A](element: A) {

val a:A = element

print(a.isInstanceOf[A])

}

생성 : 아래 모두 true 를 출력

val myClass = new myClass(1)

val myClass = new myClass[Int](1)
val myClass = new myClass("eyeballs")

val myClass = new myClass[String]("eyeballs")

추상 클래스는 abstract 를 사용하여 선언 가능하며

자기 자신은 인스턴스를 생성하지 않고 오로지 다른 클래스에 의해 상속되어지기만 하는 클래스임

선언 :

abstract class Car {

val year: Int

val color: String

}

사용 :

class myCar extends Car {

val year = 2024

val color = "Red"

}

class myCar(val year:Int = 2024) extends Car {

val color = "Red"

}

추상 클래스를 상속하지 않고도, 생성함과 동시에 내용을 구현하는 것으로 사용 가능함

선언 : abstract class Car (val year:Int) { def show }

생성 : val myCar = new Car(2024) {

def show { println(s"this car is ${year}" years old) }

}

사용 : myCar.show

혹은

new Car(2024) { def show { println(s"this car is ${year}" years old) } }.show

class 내 오버로딩도 가능함

class MyClass {

def print(a:String) = println(a)

def print(a:Int) = println(a)

def print(a:String, b:Int) = println(a+" "+b)

}

apply 라는 메소드를 구현하면, 사용할 때 메소드 이름을 사용하지 않고 생성한 클래스 이름 그대로 사용 가능

선언 :

class Multiple(factor: Int){

def apply (input: Int): Int = input * factor

}

생성 : val triple = new Multiple(3)

사용 : triple(7), triple.apply(7) //둘 다 결과 21

필드에 lazy 를 사용하면, 그 필드가 인스턴스 될 때만 생성(구현)되도록 할 수 있음

즉, lazy 필드에 처음 접근 할 때 초기화(initial) 됨

선언 :

class Lazy {

val x = { println("now initial"); 1 }

lazy val y = { println("lazy initial"); 2 }

}

생성 :

val l = new Lazy() // 여기서 "now initial" 이 출력됨

println(l.y) // 여기서 lazy val y 가 사용되었으므로, "lazy initial" 이 출력됨

기본적으로 Scala 는 프라이버시 제어를 추가하지 않음

우리가 작성한 모든 클래스는 누구나 인스턴스를 생성할 수 있고

클래스 내부 필드와 메소드에 접근 가능

하지만 원한다면 프라이버시 제어를 추가할 수 있음

바로 필드 메소드 앞에 protected 혹은 private 을 추가하는 것임

protected 가 붙은 필드와 메소드는 동일 클래스 혹은 그 클래스의 자식 클래스에서만 접근 가능하게 됨

선언 :

class User { protected val password = "12345" }

class CheckUser extends User { def isValid = ! password.isEmpty }

사용 :

new User().password // 접근 불가 에러

new CheckUser().isValid // 결과 true. 자식 클래스인 CheckUser 에서는 User 의 password 에 접근 가능

private 이 붙은 필드와 메소드는 이를 정의한 클래스에서만 접근 가능하게 됨

선언 :

class User { private val password = "12345" }

사용 :

new User().password // 접근 불가 에러

class CheckUser extends User { def isValid = ! password.isEmpty } // 자식 클래스에서 접근 불가하여 선언 에러가 발생

패키지 단위로 접근을 제어할 수 있도록 할 수 있음

선언 :

package com.eyeballs {

private[eyeballs] class Config { // com.eyeballs 패키지 내부에서만 접근 가능

val url = "eyeballs.tistory.com"

}

class Test { println(new Config().url) }

}

사용 :

new com.eyeballs.Test // 결과 : eyeballs.tistory.com

new com.eyeballs.Config // com.eyeballs 가 아닌 외부 패키지에서 Config 에 접근 불가하기 때문에 에러 발생

final 을 이용하여, 어떤 클래스의 자식 클래스를 만들지 못하도록 하거나

자식이 부모의 필드, 메소드를 재정의 할 수 없도록 할 수 있음

final class A

class B extends A // A 의 자식을 만들 수 없어 에러 발생

class A { final val a = "a" }

class B extends A { val a = "b" } // 부모 클래스의 필드인 A.a 를 재정의 할 수 없어 에러 발생

class 와 비슷하지만, 용도가 다른 object, case class 에 대해 설명함

object 는 하나 이상의 인스턴스를 가질 수 없는 형태의 class

singleton 이 적용된 class 라고 보면 됨

object 는 new 키워드로 인스턴스를 생성하지 않음

대신 이름으로 직접 해당 객체에 접근함

object 에 최초로 접근할 때 (JVM 내에서) 자동으로 인스턴스화 됨

인스턴스화는 자동으로 생성되므로, 초기화를 위한 매개변수는 갖지 않음 (대신 apply 메소드에 넣을 매개변수는 갖을 수 있음)

object 는 다른 class 를 상속받을 수 있음

하지만 다른 class 가 object 를 상속받을 수 없음

왜냐면 object 의 필드, 메소드는 전역에서 접근 가능하므로, 자식 클래스를 만들 이유가 없기 때문

선언 : object Hi { println("call Hi"); def hi = "hi" }

사용 : println(Hi.hi)

결과 : call Hi, hi

Hi.hi 를 여러번 사용시, 최초 생성된 인스턴스가 재사용됨

singleton 성격을 갖기 때문에, 순수 함수를 구현하거나 DB 를 사용하는 I/O 함수, sparkSession 을 설정하는 용도 등으로 사용

class 와 이름이 같은 object 를 동반 객체 라고 부름

동반 객체에서는 class 의 private, protected 필드 및 메소드에 접근 가능함

선언 :

class Multiplier(val x: Int) { def product(y:Int) = x*y }

object Multiplier { def apply(x: Int) = new Multiplier(x) }

사용 :

val tripler = Multiplier(3) // object 사용

val result = tripler.product(10) // 결과 30

case class 는 자동으로 생성된 메소드 몇 가지를 갖은 상태로 (인스턴스가) 생성되는 클래스

case class 는 동반 객체도 자동으로 생성하며, 이 동반 객체도 자신만의 메소드를 자동으로 생성해둠

case class 는 주로 데이터를 저장하고 전송하는 역할로 사용되며

계층적인 클래스 구조를 위해 사용되지 않는 편

왜냐하면 자동으로 만든다는 그 메소드들은 상속받은 필드들은 고려하지 않기 때문

자동으로 만들어진다는 메소드들은 다음과 같음

이름	위치	설명
apply	object (동반 객체)	case class 를 인스턴스화하는 팩토리 메소드
copy	class	요청받은 변경사항이 반영된 인스턴스의 사본을 반환. 매개변수는 현재 필드값으로 설정된 기본값을 갖는 클래스의 필드들
equals	class	다른 인스턴스의 모든 필드가 이 인스턴스의 모든 필드와 일치하면 true 반환. 연산자 == 로도 호출 가능
hashCode	class	인스턴스의 필드들의 해시 코드를 반환. 해시 기반의 컬렉션에 유용..
toString	class	클래스명과 필드들을 모아 String 으로 반환
unapply	object (동반 객체)	인스턴스를 그 인스턴스의 필드들의 튜플로 추출하여 패턴 매칭에 케이스 클래스 인스턴스를 사용할 수 있도록 함

선언 : case class Character (name: String, age: Int)

사용 :

val a = Character ("AA", 2)

val b = a.copy(name="BB")

a == b // false

리스트(List). 한 번 생성되면 내부 값을 바꿀 수 없음

내부에서 Linked List 로 구현되어 있음

val myList = List()

val myList = List(1,2,3)

val myList = List("a", "b", 1, 2)

myList(0) // "a"

myList(1) // "b"

myList(-1) // error

myList(10) // error

myList.size // 4

myList.isEmpty // false

myList == Nil // false. 여기서 Nil 은 빈 값을 갖는 리스트인 List() 의 싱글톤 인스턴스

Nil == List() // true

myList.head // "a"

val tailList = myList.tail // "b", 1, 2

for ( l <- myList ) { print(l+" ") } // a b 1 2

foreach 는 함수를 취하고, 그 함수를 리스트의 모든 항목으로 호출함

myList.foreach( l => print(l+" ") ) // a b 1 2

map 은 단일 리스트 요소를 다른 값이나 타입으로 전환하는 함수를 취함

val newList = myList.map(l => "["+l+"]") // [a], [b], [1], [2]

reduce 는 리스트 요소들을 앞에서부터 차례대로 두 개씩 선택한 후, 단일 항목으로 결합하는 함수를 취함

val combination = myList.reduce((a,b) => a+" "+b) // "a b 1 2"

리스트를 생성하는 또 다른 방법은 :: 를 사용하는 것

val myList = 1 :: 2 :: 3 :: Nil

val newList = 0 :: myList //0,1,2,3

혹은 두 리스트를 ::: 로 붙이는 것

val twoList = List(1,2) ::: List(3,4) // 1,2,3,4

리스트에 ++ 를 사용하여 다른 컬렉션(이를테면 Set)을 붙일 수 있음

val twoCollections = List(1,2) ++ Set(3,3,3) // List(1,2,3)

:+, +: 를 사용하여 간단하게 List 에 요소를 늘릴 수 있음

List(1,2,3) :+ 4 // List(1,2,3,4). Linked List 의 마지막까지 도달해야하기 때문에 성능 이슈 발생 가능

1 +: List(2,3,4) // List(1,2,3,4)

== 를 사용하여 컬렉션(리스트, 집합.. 등) 간 비교 가능. 두 컬렉션의 타입과 내용이 같으면 true

List(1,2) == List(1,3) // false

drop 으로 List 에서 처음의 n 개 요소를 제외함

val droppedList = List(1,2,3,4) drop 2 //List(3,4)

dropRight 로 List 에서 마지막의 n 개 요소를 제외함. Linked List 의 마지막 요소까지 순회해야하므로, 성능 이슈 발생 가능

val droppedList = List(1,2,3,4) dropRight 2 //List(1,2)

List 에서 distinct 로 중복 제거

List(1,2,1,2).distinct // 1, 2

List 에 filter 추가하여 true 인 것만 남길 수 있음

val filteredList = List(1,2,3,4,5) filter (_>2) // 3,4,5

nested List 가 포함된 경우, flatten 을 이용하여 내부 요소들을 모두 포함하는 단일 리스트를 만들 수 있음

List(List(1,2), List(3)).flatten //1,2,3

근데 List 가 아닌 리터럴 값이 포함되어 있으면 에러가 발생함

List(List(1,2), List(3), 4).flatten // error

partition 을 사용하여 조건의 참에 해당하는 리스트와 거짓에 해당하는 리스트 두 개를 만듦 (결과는 튜플이 됨)

val part = List(1,2,3,4,5) partition (_<3)

part._1 은 List(1,2) // true 인 값들

part._2 는 List(3,4,5) // false 인 값들

splitAt 을 사용하여 인덱스 기준으로 List 를 좌우로 쪼갬. 결과는 튜플이 됨

val split = List(1,2,3,4) splitAt 2

split._1 은 List(1,2)

split._2 는 List(3,4)

reverse 를 사용하여 List 요소의 순서를 뒤집음

List(1,2,3).reverse // List(3,2,1)

slice 를 사용하여 List 요소의 단편만 가져옴. <=_<

List(1,2,3,4,5) slice (0,0) // List()

List(1,2,3,4,5) slice (0,1) // List(1)

List(1,2,3,4,5) slice (0,2) // List(1,2)

List(1,2,3,4,5) slice (1,2) // List(2)

List(1,2,3,4,5) slice (2,1) // List()

take 를 사용하여 List 의 처음 n 개 요소만 추출함

List(1,2,3) take 2 // List(1,2)

takeRight 를 사용하여 List 의 마지막 n 개 요소만 추출함. Linked List 마지막 요소까지 순회해야 하므로 성능 이슈 발생 가능

List(1,2,3) takeRight 2 // List(2,3)

sorted 를 사용하여 List 요소를 정렬함. 사전 순 혹은 오름차순

List(3,2,1).sorted // List(1,2,3)

List('c','b','a').sorted // List('a','b','c')

sortBy 를 사용하여 원하는 기준으로 List 요소를 정렬함

List("abc","de","f") sortBy (_.size) // List("f", "de", "abc")

zip 을 사용하면, 두 List 를 각 인덱스에 해당하는 요소들끼리 묶인 튜플의 리스트로 만들 수 있음

val z = List(1,2) zip List('a','b')

z 는 List( (1,a), (2,b) )

collect 와 case 를 사용하면, List 안의 요소들을 case 에 매칭된 것만 남기고, case 의 내용대로 변환함

List("a", "b", "c") collect {

case "a" => "A"

}

결과 : List("A")

List("a", "b", "c") collect {

case "a" => "A"

case "b" => "B"

case _ => "nothing"

}

결과 : List("A", "B", "nothing")

map 을 사용하면, List 안의 모든 요소들에 특정 함수를 적용한 결과값으로 치환함

List("a", "b", "c").map(_.toUpperCase)

결과 : List("A", "B", "C")

flatMap 을 사용하면, map 처럼 List 안의 각 요소들에 특정 함수를 적용한 결괏값으로 치환하지만,

map 과 다르게 모든 결과를 평탄화하여 하나의 List 로 만들어줌

List("a,b,c","d,e,f").flatMap(_.split(","))

결과 : List("a","b","c","d","e","f")

List("a,b,c","d,e,f").map(_.split(","))

결과 : List(Array("a","b","c"), Array("d","e","f"))

List(1,2,3).max // 3 최댓값

List(1,2,3).min // 1 최솟값

List(1,2,3).product // 6 모두 곱하기

List(1,2,3).sum // 6 모두 더하기

contains 를 사용하여 List 내 요소를 포함하고 있는지 확인 가능

List(1,2,3) contains 2 // true

exists 를 사용하여 List 내 최소 하나의 요소가 조건자에 성립하는지 확인 가능

List(1,2,3).exists(_<2) // true

List(1,2,3).exists(_<1) // false

forall 을 사용하여 List 내 모든 요소가 조건자에 성립하는지 확인 가능

List(1,2,3).exists(_<2) // false

List(1,2,3).exists(_<=3) // true

startsWith 를 사용하여 List 의 처음 요소들이 특정 값을 갖는 List 로 시작하는지 확인 가능

List(1,2,3) startsWith List(1) // true

List(1,2,3) startsWith List(1,2) // true

List(1,2,3)startsWith List(1,3) // false

endsWith 를 사용하여 List 의 마지막 요소들이 특정 값을 갖는 List 로 끝나는지 확인 가능

List(1,2,3) endsWith List(3) // true

List(1,2,3) endsWith List(2,3) // true

List(1,2,3) endsWith List(1,3) // false

아래서부터 reduce 처럼 List의 값을 하나로 축소하는 함수에 대해 설명함

foldLeft 를 사용하여, List 를 주어진 시작값과 함께 왼쪽에서부터 축소

List(1,2,3).foldLeft(0)(_-_) // -6

이유 :

0 - 1 = -1

-1 - 2 = -3

-3 - 3 = -6

List(1,2,3).foldLeft(1)(_-_) // -5

List(1,2,3).foldLeft(2)(_-_) // -4

foldRight 를 사용하여, List 를 주어진 시작값과 함께 오른쪽에서부터 축소

List(1,2,3).foldRight(0)(_-_) // 2

이유 :

3 - 0 = 3

2 - 3 = -1

1 - -1 = 2

List(1,2).foldRight(0)(_-_) // -1

이유 :

2 - 0 = 2

1 - 2 = -1

reduceLeft 를 사용하여 List 를 첫번째 요소값과 함께 왼쪽에서부터 축소

List(1,2,3).reduceLeft(_-_) // -4

이유 :

1 - 2 = -1

-1 - 3 = -4

reduceRight 를 사용하여 List 를 마지막 요소값과 함께 오른쪽에서부터 축소

List(1,2,3).reduceRight(_-_) // 2

이유 :

2 - 3 = -1

1 - -1 = 2

scanLeft 를 사용하여 List 를 주어진 시작값과 함께 왼쪽에서부터 처리한 각 누곗값의 List 를 반환

List(1,2,3).scanLeft(0)(_-_) // List(0, -1, -3, -6)

이유 :

처음 주어진 값 = 0

0 - 1 = -1

-1 - 2 = -3

-3 - 3 = -6

scanRight 를 사용하여 List 를 주어진 시작값과 함께 오른쪽에서부터 처리한 각 누곗값의 List 를 반환

List(1,2,3).scanRight(0)(_-_) // List(2, -1, 3, 0)

이유 :

처음 주어진 값 = 0

3 - 0 = 3

2 - 3 = -1

1 - -1 = 2

reduceLeft, reduceRight 처럼 방향성이 있는 것과

그냥 reduce 처럼 방향성이 없는 것에 차이가 존재함

이를테면, 아래와 같은 연산을 진행할 때

방향이 존재하는 foldLeft, foldRight 는 실행 가능하고

방향이 존재하지 않는 fold 는 실행이 불가능함

List(1,2,3).foldLeft(false) {(a,b) => if(a) a else b==2} // true

List(1,2,3).foldLeft(false) {(a,b) => if(a) a else b==4} // false

List(1,2,3).foldRight(false) {(a,b) => if(b) b else a==2} // true

List(1,2,3).foldRight(false) {(a,b) => if(b) b else a==4} // false

List(1,2,3).fold(false) {(a,b) => if(a) a else b==2} // error

집합(Set). 한 번 생성되면 내부 값을 바꿀 수 없음

val mySet = Set()

val mySet = Set(1,2,3)

val mySet = Set("a","b",1,1,2,2) //"a", "b", 1, 2

Map (Java 의 HashMap, Python 의 dictionary). 이 역시 생성된 후 내부 값 변경이 불가능

val myMap = Map()

val myMap = Map(1->"a", 2->"b")

myMap(1) // "a"

myMap(2) // "b"

myMap.contains(1) // true

myMap.contains(3) // false

for ( pairs <- myMap ) {

val key = pairs._1

val value = pairs._2

println(key+" "+value)

}

결과 : 1 "a", 2 "b"

collection 간 전환은 아래와 같이 가능함

mkString 를 사용하여 collection 을 구분자로 구분된 String 으로 전환

List(1,2,3).mkString("-") //"1-2-3"

Set(1,2,3).mkString("-") //"1-2-3"

toBuffer 를 사용하여 collection 을 가변의 List 로 전환

List(1,2,3).toBuffer // Buffer(1,2,3)

Map(1->1, 2->2, 3->3).toBuffer // Buffer((1,1), (2,2), (3,3))

toList 를 사용하여 collection 을 불변의 List 로 전환

Set(1,2,3).toList // List(1,2,3)

toMap 을 사용하여 튜플이 담긴 collection 을 Map 으로 전환

Set((1,1), (2,2), (3,3)).toMap // Map(1->1, 2->2, 3->3)

toSet 을 사용하여 collection 을 Set 으로 전환

List(1,1,2,2).toSet // Set(1,2)

toString 을 사용하여, collection 의 타입과 내용을 String 으로 전환

List(1,2,3).toString // "List(1,2,3)"

Set(1,2,3).toString // "Set(1,2,3)"

JVM 을 사용하는 Scala 와 Java 간 collection 은 기본적으로 서로 호환되지 않지만

asJava, asScala 를 통해 호환되도록 만들 수 있음

import collection.JavaConverters._

val scalaList = List(1,2,3)

scalaList: List[Int] = List(1,2,3)

val javaList = scalaList.asJava

javaList: java.util.List[Int] = [1, 2, 3]

val scalaListAgain = javaList.asScala

scalaListAgain: scala.collection.mutable.Buffer[Int] = Buffer(1,2,3)

collection 을 match 표현식에 사용하는 방법 예제

val myList = List(1,2,3)

myList(0) match {

case 1 => "A"

case _ => "B"

}

결과 : "A"

myList(0) match {

case x if x > 0 => "A"

case _ => "B"

}

결과 : "A"

myList match {

case x if x contains (2) => "A"

case _ => "B"

}

결과 : "A"

myList match {

case List(1,

call by name, call by ref 차이는?

저작자표시 비영리 동일조건 (새창열림)

'Scala' 카테고리의 다른 글

[Scala] 함수 생성 val 와 def 차이 설명 링크 (0)	2020.08.25
[Scala] Regex 를 이용하여 패턴 분석하기 설명 링크 (0)	2020.08.13
[Scala] 스칼라 공부 (0)	2020.07.27
[Scala] 타입 연산 (0)	2020.07.10

Study English 24.07.03-05

눈가락 2024. 7. 6. 15:46

2024. 7. 6. 15:46

still remember I said I'was strong and wan't tired after swimming.

but, turns out, It was really wrong.

after finishing my work and dinner at 7:30, I don't have any energy to do something.

Well, I thought 1 hour between dinner and swimming class was enough to study English or Computer Science.

Now I just lay down on my sofa and watch youtube. thats all.

since Thursday, I didn't want to go to the swimming class....

9PM class ruins my schedule and sucks my energy a lot.

but while taking the class It's very fun to swim.

Thank God It's Friday. really.

during the weekend I don't need to take the class and make time to study English.

Now I'm at a cafe to review this weekend and rearrange my life's dirention.

this is the result chatgpt reviewed.

I still remember saying I was strong and wasn't tired after swimming.

It turns out I was really wrong.

After finishing my work and dinner at 7:30PM, I don't have any energy to do anything.

I thought one hour between dinner and my swimming class would be enough to study English or Computer Science.

But now, I just lay down on my sofa and watch youtube. That's all.

Since Thursday, I haven't wanted to go to the swimming class.

The 9PM class ruins my schedule and drains my energy.

But, while taking the class, It's very fun to swim.

Thank God It's Friday. Really.

During the weekend, I don't need to take the class and can make time to study English.

Now, I'm at a cafe to review this weekend and rearrange my life's direction.

저작자표시 비영리 동일조건 (새창열림)

'English' 카테고리의 다른 글

[Duo] section 01 ~ 43 (1)	2025.05.20
개발 영어 공부 (0)	2025.02.09
Study English 24.06.29-07.02 (0)	2024.07.02
Study English 24.06.28 (0)	2024.06.29
Study English 24.06.27 (0)	2024.06.27

PREV 이전 1 ···6 7 8 9 10 11 12 ···122 NEXT 다음

눈가락★

전체 글

[Hive] 기술 질문 대비 적어두는 것들

'Hadoop' 카테고리의 다른 글

개발 영어 공부

'English' 카테고리의 다른 글

[Python3] 개인 문법 공부

'Python3' 카테고리의 다른 글

[Scala] cheating sheet 정리

'Scala' 카테고리의 다른 글

Study English 24.07.03-05

'English' 카테고리의 다른 글

+ Recent posts

티스토리툴바