http://i.imgur.com/VUHWREJ.jpg
最後一個字被切掉了
我想要把後面的亂碼移掉請問要怎麼做呢?
目前試過下面這方法... 字尾還是有亂碼...
use Encode;
$str # 字串內容為上面那張圖的字串
Encode::from_to($str,'UTF-8','UTF-8');
print $str; #結果還是一樣
Anwser:
先轉成unicode把特殊字元\x{fffd}濾掉,然後再轉回utf8這樣就可以了...
Encode::from_to($str,'UTF-8','unicode');
$str =~ s/\x{fffd}//g; #這邊g的意思是... I added a "g" after the last forward slash. The "g" stands for "global", which tells Perl to replace all matches, and not just the first one. ( http://www.regular-expressions.info/perl.html )
Encode::from_to($str,'unicode','UTF-8');
其他資料:
http://stackoverflow.com/questions/6234386/how-do-i-sanitize-invalid-utf-8-in-perl
(第二個答案無效)
http://www.fileformat.info/info/unicode/char/fffd/index.htm
(查那個亂碼的unicode)
http://stackoverflow.com/questions/1016910/how-can-i-strip-invalid-xml-characters-from-strings-in-perl
http://www.perlmonks.org/?node_id=931058
(類似解法)
新方法:
http://www.ichiayi.com/wiki/tech/check_utf8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | #!/usr/bin/perl sub strip_non_utf8_characters { my $text = shift ; my $utf8_rgx ='\A( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*\z'; my $tlen = length ( $text ); print "\n length:" , $tlen ; for ( my $i =0; $i < $tlen ; $i ++){ $text = substr ( $text ,0, $tlen - $i ); return $text if ( $text =~ m/ $utf8_rgx /x ); } return '' ; } sub t{ my $text = shift ; for ( my $i =0; $i < length ( $text ) ; $i +=2){ printf ( "split length=%d response:%s\n" , $i , &strip_non_utf8_characters( substr ( $text ,0, $i )) ); } } $string = "歡迎來到全世界最大的網站" ; #&t($string); print "\n" , substr ( $string ,0,10), "\n" ; print length $string ; print "\n" ,&strip_non_utf8_characters( substr ( $string ,0,999)); #結果 |
沒有留言:
張貼留言