http://i.imgur.com/VUHWREJ.jpg
最後一個字被切掉了
我想要把後面的亂碼移掉請問要怎麼做呢?
目前試過下面這方法... 字尾還是有亂碼...
use Encode;
$str # 字串內容為上面那張圖的字串
Encode::from_to($str,'UTF-8','UTF-8');
print $str; #結果還是一樣
Anwser:
先轉成unicode把特殊字元\x{fffd}濾掉,然後再轉回utf8這樣就可以了...
Encode::from_to($str,'UTF-8','unicode');
$str =~ s/\x{fffd}//g; #這邊g的意思是... I added a "g" after the last forward slash. The "g" stands for "global", which tells Perl to replace all matches, and not just the first one. ( http://www.regular-expressions.info/perl.html )
Encode::from_to($str,'unicode','UTF-8');
其他資料:
http://stackoverflow.com/questions/6234386/how-do-i-sanitize-invalid-utf-8-in-perl
(第二個答案無效)
http://www.fileformat.info/info/unicode/char/fffd/index.htm
(查那個亂碼的unicode)
http://stackoverflow.com/questions/1016910/how-can-i-strip-invalid-xml-characters-from-strings-in-perl
http://www.perlmonks.org/?node_id=931058
(類似解法)
新方法:
http://www.ichiayi.com/wiki/tech/check_utf8
#!/usr/bin/perl sub strip_non_utf8_characters { my $text=shift; my $utf8_rgx='\A( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*\z'; my $tlen=length($text); print "\n length:",$tlen; for(my $i=0;$i<$tlen;$i++){ $text=substr($text,0,$tlen-$i); return $text if( $text=~ m/$utf8_rgx/x ); } return ''; } sub t{ my $text=shift; for(my $i=0;$i< length($text) ;$i+=2){ printf( "split length=%d response:%s\n", $i, &strip_non_utf8_characters(substr($text,0,$i)) ); } } $string = "歡迎來到全世界最大的網站"; #&t($string); print "\n",substr($string,0,10),"\n"; print length $string; print "\n",&strip_non_utf8_characters(substr($string,0,999)); #結果
沒有留言:
張貼留言