Attachment_fu sanitize filename, Regex and Unicode gotcha

Attachment_fu sanitizes the filenames on uploads to remove any funky character (not 0-9 a-z A-Z, underscore or a period). This is accomplished by the sanitize_filename private method in attachment_fu.rb file
def sanitize_filename(filename)
  returning filename.strip do |name|
    # NOTE: File.basename doesn't work right with Windows paths on Unix
    # get only the filename, not the whole path
    name.gsub! /^.*(\\|\/)/, ''

    # Finally, replace all non alphanumeric, underscore or periods with underscore
    name.gsub! /[^\w\.\-]/, '_'
  end
end
The shortcut \w is described here as letter or digit; same as [0-9A-Za-z]. However since ruby regex engine has support for unicode, letter means any unicode character. So it will let characters like 爱与希望 remain. This can be a problem if you are passing a filename containing such characters to a flash player. The flash player just won't play the file! A quick solution would be to check specifically for 0-9A-Za-z. This can be done by changing the function to
def sanitize_filename(filename)
  returning filename.strip do |name|
   # NOTE: File.basename doesn't work right with Windows paths on Unix
   # get only the filename, not the whole path
   name.gsub! /^.*(\\|\/)/, ''

   # Finally, replace all non alphanumeric, underscore or periods with underscore
   #            name.gsub! /[^\w\.\-]/, '_'
   #            Basically strip out the non-ascii alphabets too and replace with x. You don't want all _ :)
    name.gsub!(/[^0-9A-Za-z.\-]/, 'x')
  end
end
Finally this is not a problem if non ascii characters don't cause any issue in your site.

blog comments powered by Disqus