{"id":1831,"date":"2025-02-07T19:30:18","date_gmt":"2025-02-07T19:30:18","guid":{"rendered":"https:\/\/cybersecurityinfocus.com\/?p=1831"},"modified":"2025-02-07T19:30:18","modified_gmt":"2025-02-07T19:30:18","slug":"attackers-hide-malicious-code-in-hugging-face-ai-model-pickle-files","status":"publish","type":"post","link":"https:\/\/cybersecurityinfocus.com\/?p=1831","title":{"rendered":"Attackers hide malicious code in Hugging Face AI model Pickle files"},"content":{"rendered":"<div>\n<div class=\"grid grid--cols-10@md grid--cols-8@lg article-column\">\n<div class=\"col-12 col-10@md col-6@lg col-start-3@lg\">\n<div class=\"article-column__content\">\n<div class=\"container\"><\/div>\n<p>Like all repositories of open-source software in recent years, AI model hosting platform Hugging Face has been abused by attackers to upload trojanized projects and assets with the goal of infecting unsuspecting users. The latest technique observed by researchers involves intentionally broken but poisoned Python object serialization files called Pickle files.<\/p>\n<p>Often described as the GitHub for machine learning, Hugging Face is the largest online hosting database for open-source AI models and other machine learning assets. In addition to hosting services, the platform provides collaboration features for developers to share their own apps, model transformations, and model fine-tunings.<\/p>\n<p>\u201cDuring RL research efforts, the team came upon two Hugging Face models containing malicious code that were not flagged as \u2018unsafe\u2019 by Hugging Face\u2019s security scanning mechanisms,\u201d researchers from security firm ReversingLabs wrote in <a href=\"https:\/\/www.reversinglabs.com\/blog\/rl-identifies-malware-ml-model-hosted-on-hugging-face\">a new report<\/a>. \u201cRL has named this technique \u2018nullifAI,\u2019 because it involves evading existing protections in the AI community for an ML model.\u201d<\/p>\n<h2 class=\"wp-block-heading\">To ban or not to ban, that is the pickle<\/h2>\n<p>While Hugging Face supports machine learning (ML) models in various formats, Pickle is among the most prevalent thanks to the popularity of PyTorch, a widely used ML library written in Python that uses Pickle serialization and deserialization for models. Pickle is an official Python module for object serialization, which in programming languages means turning an object into a byte stream \u2014 the reverse process is known as deserialization, or in Python terminology: pickling and unpickling.<\/p>\n<p>The process of serialization and deserialization, especially of input from untrusted sources, has been the cause of many remote code execution vulnerabilities in a variety of programming languages. Similarly, the <a href=\"https:\/\/docs.python.org\/3\/library\/pickle.html\">Python documentation for Pickle<\/a> has a big red warning: \u201cIt is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.\u201d<\/p>\n<p>That poses a problem for an open platform like Hugging Face, where users openly share and have to unpickle model data. On one hand, this opens the potential for abuse by ill-intentioned individuals who upload poisoned models, but on the other, banning this format would be too restrictive given PyTorch\u2019s popularity. So Hugging Face chose the middle road, which is to attempt to scan and detect malicious Pickle files.<\/p>\n<p>This is done with an open-source tool dubbed <a href=\"https:\/\/github.com\/mmaitre314\/picklescan\">Picklescan<\/a> that essentially implements a blacklist for dangerous methods or objects that could be included in Pickle files, such as \u00a0eval, exec, compile, open, etc.<\/p>\n<p>However, researchers from security firm Checkmarx <a href=\"https:\/\/checkmarx.com\/blog\/free-hugs-what-to-be-wary-of-in-hugging-face-part-4\/\">recently showed<\/a> that this blacklist approach is insufficient and can\u2019t catch all possible abuse methods. First, they showed a bypass based on Bdb.run instead of exec, with Bdb being a debugger built into Python. Then, when that was reported and blocked, they found another bypass using an asyncio gadget that still used built-in Python functionality.<\/p>\n<h2 class=\"wp-block-heading\">Bad pickles<\/h2>\n<p>The two malicious models found by ReversingLabs used a much simpler approach: They messed with the format expected by the tool. The PyTorch format is essentially a Pickle file compressed with ZIP, but the attackers compressed it with 7-zip (7z) so the default torch.load() function would fail. This also caused Picklescan to fail to detect them.<\/p>\n<p>After unpacking them, the malicious Pickle files had malicious Python code injected into them at the start, essentially breaking the byte stream. The rogue code, when executed, opened a platform-aware reverse shell that connected back to a hardcoded IP address.<\/p>\n<p>But that got the ReversingLabs researchers wondering: How would Picklescan behave if it encountered a Pickle file in a broken format? So they created a malicious but valid file, that was correctly flagged by Picklescan as suspicious and triggered a warning, then a file with malicious code injected at the start but an \u201cX\u201d binunicode Pickle opcode towards the end of the byte stream that essentially broke the stream before the normal 0x2E (STOP) opcode was encountered.<\/p>\n<p>Picklescan produced a parsing error when it encountered the X opcode, but did not provide a warning about the suspicious functions encountered earlier in the file, which had already been executed by the time the X opcode triggered the parsing error.<\/p>\n<p>\u201cThe failure to detect the presence of a malicious function poses a serious problem for AI development organizations,\u201d the researchers wrote. \u201cPickle file deserialization works in a different way from Pickle security scanning tools. Picklescan, for example, first validates Pickle files and, if they are validated, performs security scanning. Pickle deserialization, however, works like an interpreter, interpreting opcodes as they are read \u2014 but without first conducting a comprehensive scan to determine if the file is valid, or whether it is corrupted at some later point in the stream.\u201d<\/p>\n<h2 class=\"wp-block-heading\">Avoid pickles from strangers<\/h2>\n<p>The developers of Picklescan were notified and the tool was changed to be able to identify threats in broken Pickle files, before waiting for the file to be validated first. However, organizations should remain wary of models from untrusted sources delivered as Pickle files, even if they were first scanned with tools such as Picklescan. Other bypasses are likely to be found in the future because blacklists are never perfect.<\/p>\n<p>\u201cOur conclusion: Pickle files present a security risk when used on a collaborative platform where consuming data from untrusted sources is the basic part of the workflow,\u201d the researchers wrote.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>Like all repositories of open-source software in recent years, AI model hosting platform Hugging Face has been abused by attackers to upload trojanized projects and assets with the goal of infecting unsuspecting users. The latest technique observed by researchers involves intentionally broken but poisoned Python object serialization files called Pickle files. Often described as the [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":1832,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-1831","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-education"],"_links":{"self":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/1831"}],"collection":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1831"}],"version-history":[{"count":0,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/1831\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/media\/1832"}],"wp:attachment":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1831"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1831"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1831"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}