데이터 과학 정리 (Data Preparation)

Notice

Recent Posts

Tags more

Archives

관리 메뉴

돌공공돌

2022-1

오로시 2022. 6. 6. 02:35

1. 중복값 처리

우리는 데이터를 관리 할 때, Unique 한 identifier의 개수를 구해야 한다.

그렇지 않으면 duplicate의 문제가 생긴다.

table(df_id1$ID)[table(df_id1$ID)>1] : 2개 이상인 ID를 찾아서 출력

names(table(df_id1$ID)[table(df_id$ID)>1] : 2개 이상인 ID의 이름 출력

2. missing values 처리

dropping : is.na(), na.omit(df) , complete.cases(df)

dropping? --> Lose information

Replacing or interpolating? --> distor information

3.outliers

Summary()

Hsit()

Boxplot()

rnorm(n, mean, sd)

'2022-1' Related Articles

Comments