R 정규표현식 - 2

설명에 들어가기 전에 결과를 확인해볼 수 있도록 regmatches라는 함수를 활용해보자.
이는 regexpr / gregexpr 함수와 함께 사용된다.
먼저 regexpr을 활용해 보자.

strings <- "A00-B99 I. 특정 감염성 및 기생충성 질환 Certain infectious and parasitic diseases, A00-B99"
pattern <- "B" # B라는 패턴을 찾는다.
m <- regexpr(pattern, strings) 
regmatches(strings, m)

## [1] "B"

다음은, gregexpr를 활용했을 때의 예이다.

m <- gregexpr(pattern, strings)
regmatches(strings, m)

## [[1]]
## [1] "B" "B"

regexpr의 경우 정의와 같이 첫번째 발견되는 인수만 찾아내고,
gregexpr의 경우는 pattern이 일치하는 모든 인수를 찾아낸다.

실제 정규표현식의 활용

다음의 내용을 코드와 한글명, 영문명으로 나누고자 한다.
그 전에 간단한 워밍업을 해보자.

strings <- "A00-B99 I. 특정 감염성 및 기생충성 질환 Certain infectious and parasitic diseases, A00-B99"
pattern <- "A00"
unlist(regmatches(strings, gregexpr(pattern, strings))) # 결과값이 list 구조이기 때문에 unlist 적용

## [1] "A00" "A00"

결과값은 보시다시피 앞의 ‘A00’ 그리고 뒤의 ‘A00’ 두 개의 결과값이 모두 검색된다.
만약 ‘A00-B99’를 추출하고 싶다면 어떻게 결과를 검색해야할까?

strings <- "A00-B99 I. 특정 감염성 및 기생충성 질환 Certain infectious and parasitic diseases, A00-B99"
pattern <- "^[A-Z][A-Z0-9-]+"
unlist(regmatches(strings, gregexpr(pattern, strings))) # 결과값이 list 구조이기 때문에 unlist 적용 

## [1] "A00-B99"

여기서 약간의 설명이 필요한데 대괄호’[]‘는 정규표현식에서 ‘집합’을 의미한다.
만약, [A-Z]라 하면 영어대문자 한글자의 집합이라고 할 수 있다. 앞에 존재하는 ^표기는 해당 문자로 시작하는 경우만을 의미한다. 즉 ^[A-Z]‘은 영문 대문자로 시작하는 문자 한글자만을 의미한다. 실제로 확인해보자.

unlist(regmatches("A00-B99", gregexpr("^[A-Z]", "A00-B99")))

## [1] "A"

unlist(regmatches("C00-D99", gregexpr("^[A-Z]", "C00-D99")))

## [1] "C"

만약 ^표기가 없다면 어떨까? 직접 실행해보자.

unlist(regmatches("A00-B99", gregexpr("[A-Z]", "A00-B99")))

## [1] "A" "B"

unlist(regmatches("C00-D99", gregexpr("[A-Z]", "C00-D99")))

## [1] "C" "D"

예상대로 모든 영문 대문자를 찾게 된다. 그렇다면 영문 대문자로 시작하면서 또 붙어 있는 여러 글자를 찾으려면 어떻게 해야할까?

unlist(regmatches("ABCDEFG1234567", gregexpr("[A-Z]+", "ABCDEFG1234567")))

## [1] "ABCDEFG"

+표기는 여러 글자를 의미한다. 다른 방법으로 {}을 소개하기로 한다.

unlist(regmatches("ABCDEFG1234567", gregexpr("^[A-Z]{1,2}", "ABCDEFG1234567"))) # 1~2글자

## [1] "AB"

unlist(regmatches("ABCDEFG1234567", gregexpr("^[A-Z]{1,3}", "ABCDEFG1234567"))) # 1~3글자 

## [1] "ABC"

unlist(regmatches("ABCDEFG1234567", gregexpr("^[A-Z]{1,4}", "ABCDEFG1234567"))) # 1~4글자

## [1] "ABCD"

unlist(regmatches("ABCDEFG1234567", gregexpr("^[A-Z]{1,}",  "ABCDEFG1234567"))) # 1이상의 글자

## [1] "ABCDEFG"

자 그럼 “A00-B99 I. 특정 감염성 및 기생충성 질환 Certain infectious and parasitic diseases, A00-B99”라는 문자열에서 “A00-B99”를 찾아내기 위한 나머지 방법을 생각해보자.
영문대문자로 시작하는 단어는 ^[A-Z]로 설정하였으니, “00-B99”를 추가적으로 찾아내야 한다.
이를 정규식으로 표현하면, [0-9A-Z-]+가 된다. 0-9는 숫자, A-Z는 영문대문자 ‘-‘는 dash를 의미한다.
이러한 글자가 들어가는 집합을 구성하고 여러 글자를 의미하는 +를 붙이면 이제 우리가 원하는 글자를 찾게 된다.

다시 한 번 확인해 보자.

strings <- "A00-B99 I. 특정 감염성 및 기생충성 질환 Certain infectious and parasitic diseases, A00-B99"
pattern <- "^[A-Z][A-Z0-9-]+"
unlist(regmatches(strings, gregexpr(pattern, strings))) # 결과값이 list 구조이기 때문에 unlist 적용 

## [1] "A00-B99"

to be continued…

stay with problems longer

R 정규표현식 - 2