ETL工具sed進階是怎么樣的,很多新手對此不是很清楚,為了幫助大家解決這個難題,下面小編將為大家詳細講解,有這方面需求的人可以來學習下,希望你能有所收獲。
專注于為中小企業(yè)提供成都網(wǎng)站建設(shè)、網(wǎng)站建設(shè)服務(wù),電腦端+手機端+微信端的三站合一,更高效的管理,為中小企業(yè)上黨免費做網(wǎng)站提供優(yōu)質(zhì)的服務(wù)。我們立足成都,凝聚了一批互聯(lián)網(wǎng)行業(yè)人才,有力地推動了上1000+企業(yè)的穩(wěn)健成長,幫助中小企業(yè)通過網(wǎng)站建設(shè)實現(xiàn)規(guī)模擴充和轉(zhuǎn)變。我覺得 sed 玩到最后,應(yīng)該觸及的最高難度的問題,有這些:
替換百萬行文本,sed 的處理速度如何
sed 作為 ETL 工具,與 MySQL, Oracle 等連接起來,做交互式操作
sed 會有異常嗎,那么如何處理:比如處理百萬數(shù)據(jù)失效了
而這一切才剛剛開始!
sed 's/pattern/replacement/' inputfile
經(jīng)典的用法就是這樣。
但實際運作起來,并非像我們想象的那樣:
[root@centos00 _data]# cat hw.txt
this is the profession tool on the professional platform
this is the man on the earth
[root@centos00 _data]# sed 's/the/a/' hw.txt
this is a profession tool on the professional platform
this is a man on the earth
[root@centos00 _data]#
雖然我們制定了 pattern, 但 replacement 只替換了每行第一次出現(xiàn)的指定文本。
所以有了這些 s 命令的衍生:
s/pattern/replacement/flag
數(shù)字:指定第幾處符合指定模式的文本被替換;
g: 替換所有符合的模式文本;
p: 原先的內(nèi)容文本先打印出來;
w filename: 將替換的結(jié)果寫入到文件里面去
替換掉所有的符合模式條件的文本:
[root@centos00 _data]# sed 's/the/a/g' hw.txt
this is a profession tool on a professional platform
this is a man on a earth
將結(jié)果寫入到另一個文本文件:
[root@centos00 _data]# sed 's/the/a/w dts.txt' hw.txt
this is a profession tool on the professional platform
this is a man on the earth
[root@centos00 _data]# cat dts.txt
this is a profession tool on the professional platform
this is a man on the earth
[root@centos00 _data]#
[root@centos00 _data]# sed 's!/bin/bash!/bin/csh!' /etc/passwd
root:x:::root:/root:/bin/csh
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5::sync:/sbin:/bin/sync
使用 ! 亦可以作為分隔符。因為 / 和路徑分隔符重合,而轉(zhuǎn)義的時候,會加很多 \ 符,因此不是很好讀。
還可以用@ 作為分隔符
[root@centos00 _data]# sed 's@/bin/bash@/bin/csh@' /etc/passwd
root:x:::root:/root:/bin/csh
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5::sync:/sbin:/bin/sync
不禁要問自己的問題是,到底還有多少符號可以用來作為分隔符?
參考官方文檔,貌似任何的字符都可以作為分隔符,是根據(jù)s后面第一個遇到的符號作為分隔符:
https://www.gnu.org/software/sed/manual/html_node/The-_0022s_0022-Command.html
[root@centos00 _data]# sed 's6a6the6g' dts.txt
this is the profession tool on the professionthel plthetform
this is the mthen on the etherth
[root@centos00 _data]#
瞧,說的沒錯把。s 命令后面第一個字符,就是當做分隔符。
貌似這篇文章還有點深入的:
There are two levels of interpretation here: the shell, and sed.
In the shell, everything between single quotes is interpreted literally, except for single quotes themselves. You can effectively have a single quote between single quotes by writing '\'' (close single quote, one literal single quote, open single quote).
Sed uses basic regular expressions. In a BRE, in order to have them treated literally, the characters $.*[\]^ need to be quoted by preceding them by a backslash, except inside character sets ([…]). Letters, digits and (){}+?| must not be quoted (you can get away with quoting some of these in some implementations). The sequences \(, \), \n, and in some implementations \{, \}, \+, \?, \| and other backslash+alphanumerics have special meanings. You can get away with not quoting $^] in some positions in some implementations.
Furthermore, you need a backslash before / if it is to appear in the regex outside of bracket expressions. You can choose an alternative character as the delimiter by writing, e.g., s~/dir~/replacement~ or \~/dir~p; you'll need a backslash before the delimiter if you want to include it in the BRE. If you choose a character that has a special meaning in a BRE and you want to include it literally, you'll need three backslashes; I do not recommend this, as it may behave differently in some implementations.
In a nutshell, for sed 's/…/…/':
Write the regex between single quotes.
Use '\'' to end up with a single quote in the regex.
Put a backslash before $.*/[\]^ and only those characters (but not inside bracket expressions).
Inside a bracket expression, for - to be treated literally, make sure it is first or last ([abc-] or [-abc], not [a-bc]).
Inside a bracket expression, for ^ to be treated literally, make sure it is not first (use [abc^], not [^abc]).
To include ] in the list of characters matched by a bracket expression, make it the first character (or first after ^ for a negated set): []abc] or [^]abc] (not [abc]] nor [abc\]]).
In the replacement text:
& and \ need to be quoted by preceding them by a backslash, as do the delimiter (usually /) and newlines.
\ followed by a digit has a special meaning. \ followed by a letter has a special meaning (special characters) in some implementations, and \ followed by some other character means \c or c depending on the implementation.
With single quotes around the argument (sed 's/…/…/'), use '\'' to put a single quote in the replacement text.
If the regex or replacement text comes from a shell variable, remember that
The regex is a BRE, not a literal string.
In the regex, a newline needs to be expressed as \n (which will never match unless you have other sed code adding newline characters to the pattern space). But note that it won't work inside bracket expressions with some sed implementations.
In the replacement text, &, \ and newlines need to be quoted.
The delimiter needs to be quoted (but not inside bracket expressions).
Use double quotes for interpolation: sed -e "s/$BRE/$REPL/".
行尋址:
第一種數(shù)字尋址:使用明確的行號,1,2,4 來標識需要匹配的行:
[root@centos00 _data]# sed '1s6a6the6g' dts.txt
this is the profession tool on the professionthel plthetform
this is a man on the earth
[root@centos00 _data]# sed '2s6a6the6g' dts.txt
this is a profession tool on the professional platform
this is the mthen on the etherth
[root@centos00 _data]#
第二種使用正則,當然這種方法更為靈活:
[root@centos00 _data]# sed '/platform/s6a6the6g' dts.txt
this is the profession tool on the professionthel plthetform
this is a man on the earth
命令執(zhí)行:
[root@centos00 _data]# sed '/platform/{
s6a6the6g
s6on6above6g
}' dts.txt
this is the professiabove tool above the professiabovethel plthetform
this is a man on the earth
[root@centos00 _data]# sed '/platform/
{s6a6the6g
s6on6above6g
}' dts.txt
sed: -e expression #1, char 11: unknown command: `
'
[root@centos00 _data]#
單行命令我已經(jīng)描述過了,但多行命令應(yīng)用到同一行還是有些不一樣。比如{}的閉合就有說法,就像卡波蒂所說,一個標點符號的錯位都有可能引起文章句意的不同。這里還是要注意。
官方文檔有篇文章,介紹 sed 是如何工作的,我覺得蠻有意思:
6.1 How sed Works
sed maintains two data buffers: the active pattern space, and the auxiliary hold space. Both are initially empty.sed operates by performing the following cycle on each line of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space. Then commands are executed; each command can have an address associated to it: addresses are a kind of condition code, and a command is only executed if the condition is verified before the command is to be executed.
When the end of the script is reached, unless the -n option is in use, the contents of pattern space are printed out to the output stream, adding back the trailing newline if it was removed.8 Then the next cycle starts for the next input line.
Unless special commands (like ‘D’) are used, the pattern space is deleted between two cycles. The hold space, on the other hand, keeps its data between cycles (see commands ‘h’, ‘H’, ‘x’, ‘g’, ‘G’ to move data between both buffers).
sed 按行處理文本時,會開辟兩塊緩沖區(qū),pattern 空間和 hold 空間。
pattern 空間是保留去行首尾換行符之后的所有文本。一旦對這行文本處理完畢,就“倒掉” pattern 空間中的文本,換一下行。作為臨時性的貯存區(qū),每一次的換行都將清除 pattern 空間中的文本數(shù)據(jù)。
而 hold 空間則是保留了每次換行之后,前一行的數(shù)據(jù)。
接下來的進階版文章中,會逐漸引入 pattern space, hold space 的概念。
#### 多行命令
在整個文本文件中尋找模式,就需要考慮多行(跨行)的問題。因為模式可能不會存在單行上,或被分割成相鄰的兩行,或模式尋找的范圍更廣,需要將整篇文章作為搜索對象。所以多行就變成了必須。
硬編碼的多行,用 n;n;… 來表示的例子:
[root@centos00 _data]# sed '{/professional/{n;d}}' dts.txt
this is a profession tool on the professional platform
this is a man on the earth
i like better man
[root@centos00 _data]#
定位到含有 professional 那行,并且刪除下面一行。
這里 n; 僅僅是為了可以定位更加機動化。試想如果不用 n;想要刪除其中的空行, 那么使用 ^
不能識別此Latex公式:
就將移除所有的空行:
[root@centos00 _data]# sed '{/^$/d}' dts.txt
this is a profession tool on the professional platform
this is a man on the earth
i like better man
[root@centos00 _data]#
這里用到了正則,說明下:
正則表達式是用模式匹配來過濾文本的工具。
在 Linux 中,正則表達式引擎有兩種:
BRE - 基本正則表達式引擎(Basic Regular Expressions)
ERE - 擴展正則表達式引擎(Extentional Regular Expressions)
sed 使用的是 BRE 引擎,而且用的還是 BRE 引擎中更小的一部分表達式,因此速度超快,但功能受限;
gawk 使用的是 ERE 引擎,重武器庫型編輯工具(實際上具有可編程性),因此表達式豐富,但是速度可能較慢。
錨定字符:
行首定位 ^
行尾定位
不能識別此Latex公式:
空行:^
多行匹配
[root@centos00 Documents]# sed '/first/{N;s/\n/ /;s/line/user/g}' MultiLine.txt
this is the header line
this is the first user this is the second user
this is the third line
this is the end
[root@centos00 Documents]# sed '/first/{N;s/\n/ /;s/first.*second/user/g}' MultiLine.txt
this is the header line
this is the user line
this is the third line
this is the end
[root@centos00 Documents]#
第一個例子,我們先找有 first 存在的那行,接著將下一行的文本也附加到找到的這行來(其實是存在于 pattern space),然后對于這行中的換行符(\n)做了替換處理,要不兩行還是顯示兩行,替換了換行符,將所有 line 文本替換為 user;
第二個例子更有意思,除了連接符合條件行的兩行之外,還用“.”通配符,替換了整個包含符合條件的文本,從而實現(xiàn)了兩行搜索。
當然還可以連著搜索三行:
[root@centos00 Documents]# sed '/first/{N;N;s/\n/ /g;s/first.*third/user/g}' MultiLine.txt
this is the header line
this is the user line
this is the end
[root@centos00 Documents]#
這里可以想象如果是整個文本文件呢?
反轉(zhuǎn)文本順序
要實現(xiàn)文本文件的行順序反轉(zhuǎn),需要用到兩個概念:
Hold space 保持空間
排除命令!
Hold space 的概念很有意思,和 pattern space 一樣的是他們都被 sed 用來存儲臨時數(shù)據(jù),不一樣的是 hold space 保留的數(shù)據(jù),時效性更長一些,而 pattern space 的數(shù)據(jù)在存儲下一行數(shù)據(jù)之前,會被清空。且兩種空間之間的數(shù)據(jù)可以互相交換。
sed 編輯器的 hold space 命令:
命令 解釋 h 將模式空間復制到保持空間 H 將模式空間附加到保持空間 g 將保持空間復制到模式空間 G 將保持空間附加到模式空間 x 交換模式空間和保持空間的內(nèi)容
將文件中內(nèi)容按行倒序:
[root@centos00 Documents]# cat seqnumber.txt
1
2
3
4
5
6
[root@centos00 Documents]# sed -n '{G;h;s/\n//g;$p}' seqnumber.txt
654321
[root@centos00 Documents]#
在本例中,G;h;就是利用了 pattern, hold space 的命令,做出兩空間中數(shù)據(jù)的移動。
這里特別要注意的是
p 中
的應(yīng)用。每個單字命令前面都可以帶地址空間尋址,
就是尋到最后一行數(shù)據(jù)。有兩個作用,一是對符合條件的行不執(zhí)行命令,二是對不符合條件的那些行則堅決執(zhí)行這些命令
[root@centos00 Documents]# sed -n '{G;h;$p}' seqnumber.txt
6
5
4
3
2
1
[root@centos00 Documents]# sed -n '{1!G;h;$p}' seqnumber.txt
6
5
4
3
2
1
[root@centos00 Documents]#
1!G就表示僅在第一行排除使用 G 命令,因為第一行讀取時,hold space 并沒有內(nèi)容,是空值(看第一個結(jié)果,末尾有個空行),只執(zhí)行 h; 而其他行都會一次執(zhí)行 G;h;, 最后一行還會執(zhí)行 p 的操作。
[address]b[label]
[address] 是定位表達式,label 是用來表示特定的一組命令的標記。
[root@centos00 Documents]# cat MultiLine.txt
this is the header line
this is the first line
this is the second line
this is the third line
this is the end
[root@centos00 Documents]# sed '{ /second/bchg;s/[ ]is[ ]/ was /g;:chg s/line/user/ }' MultiLine.txt
this was the header user
this was the first user
this is the second user
this was the third user
this was the end
[root@centos00 Documents]#
值得注意的是,所有的命令都會被依次執(zhí)行,但符合條件的行只被執(zhí)行標記出來的命令。以上代碼中, is 被替換成 was 只有在行內(nèi)容中沒有 second 的那些行,才執(zhí)行。而所有的行,都會執(zhí)行替換 line 成 user 的操作。
當然,為了閱讀美觀性,[address]b [label]之間可以加一個空格:
[root@centos00 Documents]# sed '{ /second/b chg;s/[ ]is[ ]/ was /g;:chg s/line/user/ }' MultiLine.txt
this was the header user
this was the first user
this is the second user
this was the third user
this was the end
[root@centos00 Documents]#
如果在跳轉(zhuǎn)命令后面什么標識(label)都不注明,那么符合條件的這行將跳過所有的命令,知道末尾退出,什么都不做!
[root@centos00 Documents]# sed '{ /second/b;s/[ ]is[ ]/ was /g;:chg s/line/user/ }' MultiLine.txt
this was the header user
this was the first user
this is the second line
this was the third user
this was the end
[root@centos00 Documents]#
除了放在末尾外,label 也可以放在首部命令的位置,這樣就造成了調(diào)用 label 命令時的循環(huán):
[root@centos00 Documents]# echo 'this,is,a,header,line,' | sed ':rmc s/,/ / ; b rmc ;'
^C
[root@centos00 Documents]# echo 'this,is,a,header,line,' | sed ':rmc s/,/ / ; /,/b rmc ;'
this is a header line
[root@centos00 Documents]#
為了防止死循環(huán),加上判斷,比如是否還有滿足條件的情況(還有逗號)可以有效停止循環(huán)。
[root@centos00 Documents]# cat sed_t.sed
{
s/second/sec/
t
s/[ ]is[ ]/ was /
;
}
[root@centos00 Documents]# sed -f sed_t.sed MultiLine.txt
this was the header line
this was the first line
this is the sec line
this was the third line
this was the end
[root@centos00 Documents]#
測試命令,完成了 if-then-else-then 的結(jié)構(gòu):
if
s/second/sec/
else
s/[ ]is[ ]/ was /
如果沒有完成 s/second/sec/ 的替換,那么執(zhí)行 s/[ ]is[ ]/ was / 的替換。
t 和 b 的引用風格也一樣 :
[address]t [label]
但這里[address]是替換成了s/// 的替換命令:
[s/second/sec/]t [label]
完整的寫起來是這么回事,前面例子省卻了 label, 則自動跳轉(zhuǎn)到命令腳本末尾,即什么也不發(fā)生。
[root@centos00 Documents]# cat sed_t_header.sed
{
s/header/beginning/
t chg
s/line/user/
:chg
s/beginning/beginning header/
}
[root@centos00 Documents]# sed -f sed_t_header.sed MultiLine.txt
this is the beginning header line
this is the first user
this is the second user
this is the third user
this is the end
[root@centos00 Documents]#
值得注意的是,t 的腳本中,命令也是依次執(zhí)行的, chg 的命令同樣也會作用于每一行上,只是不起作用而已。
[root@centos00 Documents]# echo 'the cat is sleeping in his hat' | sed 's/.at/"&"/g'
the "cat" is sleeping in his "hat"
[root@centos00 Documents]#
“.”指代任意一個字符,所以 cat, hat 都匹配的上。用 & 標識整個模式匹配的上的字符串,將其前后加上雙引號。
[root@centos00 Documents]# sed 's/this\(.*line\)/that\1/;p;' -n MultiLine.txt
that is the header line
that is the first line
that is the second line
that is the third line
this is the end
[root@centos00 Documents]#
有意思的事情是, \1, \2, \3, \n 標識了每個用 () 標記起來的模式子字符串,在替換命令中,使用了 \1,\2… 指代符的維持原來內(nèi)容不變,而沒有 \1, \2… 標記起來的內(nèi)容,則全部替換。
案例:
給每行加個行號:
[root@centos00 Documents]# cat MultiLine.txt
this is the header line
this is the first line
this is the second line
this is the third line
this is the end
[root@centos00 Documents]# sed ' = ' MultiLine.txt | sed 'N;s/\n//g'
1this is the header line
2this is the first line
3this is the second line
4this is the third line
5this is the end
6
7
[root@centos00 Documents]#
看完上述內(nèi)容是否對您有幫助呢?如果還想對相關(guān)知識有進一步的了解或閱讀更多相關(guān)文章,請關(guān)注創(chuàng)新互聯(lián)-成都網(wǎng)站建設(shè)公司行業(yè)資訊頻道,感謝您對創(chuàng)新互聯(lián)的支持。