2013年5月22日 星期三

perl 中文字尾亂碼

Question:

http://i.imgur.com/VUHWREJ.jpg

最後一個字被切掉了

我想要把後面的亂碼移掉請問要怎麼做呢?

目前試過下面這方法... 字尾還是有亂碼...

use Encode;

$str # 字串內容為上面那張圖的字串

Encode::from_to($str,'UTF-8','UTF-8');

print $str; #結果還是一樣

Anwser:

先轉成unicode把特殊字元\x{fffd}濾掉,然後再轉回utf8這樣就可以了...

Encode::from_to($str,'UTF-8','unicode');
$str =~ s/\x{fffd}//g; #這邊g的意思是... I added a "g" after the last forward slash. The "g" stands for "global", which tells Perl to replace all matches, and not just the first one.  ( http://www.regular-expressions.info/perl.html )
Encode::from_to($str,'unicode','UTF-8');


其他資料:
http://stackoverflow.com/questions/6234386/how-do-i-sanitize-invalid-utf-8-in-perl
(第二個答案無效)
http://www.fileformat.info/info/unicode/char/fffd/index.htm
(查那個亂碼的unicode)
http://stackoverflow.com/questions/1016910/how-can-i-strip-invalid-xml-characters-from-strings-in-perl
http://www.perlmonks.org/?node_id=931058
(類似解法)

新方法:
http://www.ichiayi.com/wiki/tech/check_utf8



  
#!/usr/bin/perl

sub strip_non_utf8_characters {
    my $text=shift;
    my $utf8_rgx='\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z';
  my $tlen=length($text);
  print "\n length:",$tlen;
  for(my $i=0;$i<$tlen;$i++){
    $text=substr($text,0,$tlen-$i);
    return $text if( $text=~ m/$utf8_rgx/x );
  }
  return '';
}

sub t{
  my $text=shift;
  for(my $i=0;$i< length($text) ;$i+=2){
    printf( "split length=%d response:%s\n",
      $i,
      &strip_non_utf8_characters(substr($text,0,$i))
    );
  }
}
$string = "歡迎來到全世界最大的網站";
#&t($string);
print "\n",substr($string,0,10),"\n";
print length $string;
print "\n",&strip_non_utf8_characters(substr($string,0,999)); #結果

  

沒有留言:

張貼留言