[PySpark] sample dataframe 만들기

눈가락 2022. 5. 9. 15:49

2022. 5. 9. 15:49

pyspark

#create an app named linuxhint
spark = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[{'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'},
               {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'},
               {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'},
               {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'},
               {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark.createDataFrame( students)

# null 값이 포함된 dataframe 만드는 두 가지 방법
data_with_null = [{'a':None, 'b':2} , {'a':1, 'b':None}]
df = spark.createDataFrame( data_with_null )

#혹은 lit(None) 을 새로운 컬럼으로 추가
from pyspark.sql.types import StringType
from pyspark.sql.functions import lit
df = df.withColumn("null_val", lit(None).cast(StringType()))

https://stackoverflow.com/questions/33038686/add-an-empty-column-to-spark-dataframe

null 값이 포함된 column 에서 null 값을 바꾸려면 fill, fillna 함수 혹은 coalesce 함수 사용
아래 링크 참고
https://sparkbyexamples.com/pyspark/pyspark-fillna-fill-replace-null-values/
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.coalesce.html

# 1이상 10미만 연속된 숫자를 갖는 샘플 데이터
df = spark.range(1, 10).toDF("nums")

https://linuxhint.com/sum-pyspark/

# timestamp 만들기
from pyspark.sql.functions import current_timestamp, col, to_utc_timestamp
from pyspark.sql.types import TimestampType
df = df.withColumn("current_timestamp", current_timestamp()).withColumn("to_utc_timestamp", to_utc_timestamp(col("current_timestamp"), "Asia/Seoul"))

scala spark

val rdd = spark.sparkContext.parallelize(Seq(
  Item(1, "Thingy A", "awesome thing.", "high", 0),
  Item(2, "Thingy B", "available at http://thingb.com", null, 0),
  Item(3, null, null, "low", 5),
  Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
  Item(5, "Thingy E", null, "high", 12)))

val data = spark.createDataFrame(rdd)

https://github.com/awslabs/deequ

저작자표시 비영리 동일조건 (새창열림)

'Spark' 카테고리의 다른 글

[Spark] Application WebUI 설명 (1)	2025.05.29
[Spark] Scala 다양한 연산 모음 (0)	2022.09.26
[PySpark] 컬럼의 합 구하는 방법 (0)	2021.06.16
[PySpark] 여러 path 에서 데이터 읽는 방법 (0)	2021.06.16
[Spark] SQL Built-in Functions 문서 링크 (0)	2021.06.16

눈가락★

[PySpark] sample dataframe 만들기

'Spark' 카테고리의 다른 글

+ Recent posts

티스토리툴바