Safeguarding Privacy with Synthetic Data - PC & ASSOCIATES CONSULTING

by Alys Woodward, Senior Director Analyst at Gartner

A major problem with AI development today is the burden involved in obtaining real-world data and labeling it. In fact, data availability was selected as one of the top five barriers to implementing generative AI (GenAI) in a Gartner survey of 644 organizations done in the fourth quarter of 2023.

Synthetic data can help solve this problem. With orders of magnitude less privacy risk than real data, synthetic data can open a range of opportunities to train machine learning models and analyze data that would not be available if real data were the only option.

However, it’s important to understand how synthetic data can overcome privacy, compliance and data anonymization challenges, as well as the issues impeding its widespread adoption.

Addressing privacy challenges

Synthetic data helps organizations address privacy challenges while training their AI, machine learning (ML), or computer vision (CV) models.

Synthetic data can bridge information silos by acting as a substitute for real data and not revealing sensitive information, such as personal details and intellectual property. Since synthetic datasets maintain statistical properties that closely resemble the original data, they can produce precise training and testing data that is crucial for model development.

Training computer vision models often requires a large and diverse set of labeled data to build highly accurate models. Obtaining and using real data for this purpose can be challenging, especially when it involves personally identifiable information (PII).

Two common use cases that require PII data are ID verification and automated driver assistance systems (ADAS), which monitor movements and actions in the driver’s area. In these situations, synthetic data can be useful for generating a range of facial expressions, skin color and texture, as well as additional objects like hats, masks, and sunglasses. ADAS also requires AI to be trained for low-light conditions, such as driving in the dark.

Mitigating challenges associated with data anonymization

Efforts to manually anonymize and de-identify datasets – remove information that links a data record to a specific individual – are often time consuming, labor intensive and prone to errors.

Ultimately, this can delay projects and lengthen the iteration cycle time for development of ML algorithms and models. Synthetic data can overcome many of these pitfalls by providing faster, cheaper and easier access to data that is similar to the original source, suitable for use and protects privacy.

Furthermore, if manually anonymized data is combined with other publicly available data sources, there’s a risk it could inadvertently reveal information that could lead to data re-identification, thus breaching data privacy. Leaders can use techniques such as differential privacy to ensure any synthetic data generated from real data is at very low risk of deanonymization.

Challenges hindering widespread adoption

Creating a synthetic tabular dataset involves striking a balance between privacy and utility, ensuring the data remains useful and accurately represents the original dataset. If the utility is too high, privacy may be compromised, especially for unique or distinctive records, as the synthetic dataset could be matched with other data sources.

Conversely, methods to enhance privacy, such as disconnecting certain attributes or introducing ‘noise’ via differential privacy, can inherently diminish the dataset’s utility.

Over the past decades of data management, low quality of transaction data has been an ongoing challenge. For example, call center agents might fail to complete full address data, or customer information. This missing data can prevent analysis. To counteract this, IT organizations needed to educate business users on how important good data quality is to both applications and analytics. “Garbage in means garbage out” was the commonly accepted principle.

However, this now affects people’s attitudes to synthetic data as they believe it must be inferior because it’s not real data, which delays adoption. In reality, synthetic data can be better than real data, not in how it represents the current world, but in how it can train AI models to work with the ideal or future world.

A synthetic dataset mirrors the original dataset. Therefore, if the original does not include unusual occurrences or “edge cases,” these won’t appear in the synthetic dataset either. This is particularly important for image and video synthetic data in areas like autonomous driving, where many hours of driving footage are used to train the AI. However, unusual situations like emergency vehicles, driving in snow or animals on the road need to be created.

โดย อลิส วูดเวิร์ด ผู้อำนวยการอาวุโสฝ่ายวิจัยของการ์ทเนอร์

ปัญหาสำคัญอย่างหนึ่งของการพัฒนา AI ในปัจจุบันคืออุปสรรคจากการรวบรวมข้อมูลของโลกความจริงและการติดป้ายกำกับให้กับข้อมูลนั้น ๆ ซึ่งในความเป็นจริงความพร้อมใช้งานของข้อมูลหรือ Data Availability เป็นหนึ่งในห้าอุปสรรคหลักในการนำ Generative AI มาใช้งาน จากผลการสำรวจของการ์ทเนอร์กับองค์กร 644 แห่ง ช่วงไตรมาสสี่ของปี 2566 ชี้ให้เห็นว่า ข้อมูลสังเคราะห์หรือ Synthetic Data สามารถช่วยแก้ปัญหานี้ได้ เนื่องจากมีความเสี่ยงด้านความเป็นส่วนตัวน้อยกว่าข้อมูลจริงหลายเท่า และ Synthetic Data ยังเปิดโอกาสในด้านการฝึกโมเดลแมชชีนเลิร์นนิ่งและวิเคราะห์ข้อมูลอีกมากมายที่ไม่น่าทำได้ในกรณีที่มีข้อมูลจริงเพียงชุดเดียวให้เลือก

อย่างไรก็ตาม สิ่งสำคัญคือต้องเข้าใจว่า Synthetic Data สามารถก้าวข้ามความท้าทายเรื่องความเป็นส่วนตัว การปฏิบัติตามข้อกำหนดและการไม่เปิดเผยข้อมูลได้อย่างไร รวมถึงปัญหาที่เป็นอุปสรรคต่อการนำเอาไปใช้ในวงกว้าง

จัดการความท้าทายด้านความเป็นส่วนตัว

Synthetic Data ช่วยองค์กรจัดการความท้าทายด้านความเป็นส่วนตัว พร้อมฝึกอบรมโมเดล AI, ML หรือคอมพิวเตอร์วิชัน (CV)

Synthetic Data สามารถเชื่อมโยงข้อมูลภายในเข้าด้วยกัน โดยทำหน้าที่แทนข้อมูลจริงและไม่เปิดเผยข้อมูลที่ละเอียดอ่อน อาทิ ข้อมูลส่วนบุคคลและทรัพย์สินทางปัญญา เนื่องจากชุดข้อมูลสังเคราะห์ยังคงคุณสมบัติทางสถิติที่ใกล้เคียงกับข้อมูลต้นฉบับ จึงสามารถสร้างข้อมูลฝึกอบรมและทดสอบที่แม่นยำ ที่มีความสำคัญต่อการพัฒนาแบบจำลอง

การฝึกโมเดล Computer Vision ต้องใช้ชุดข้อมูลที่มีป้ายกำกับจำนวนมากและหลากหลาย เพื่อสร้างโมเดลที่มีความแม่นยำสูง ซึ่งการรับและการใช้ข้อมูลจริงเพื่อจุดประสงค์นี้อาจเป็นเรื่องท้าทาย โดยเฉพาะอย่างยิ่งเมื่อเกี่ยวข้องกับข้อมูลที่ระบุตัวบุคคลได้หรือ Personally Identifiable Information (PII)

ยูสเคสการใช้งานโดยทั่วไปมี 2 กรณีที่ต้องใช้ข้อมูล PII ได้แก่ การยืนยันตัวตนและระบบช่วยเหลือผู้ขับขี่อัตโนมัติ หรือ Automated Driver Assistance Systems (ADAS) ซึ่งคอยตรวจสอบการเคลื่อนไหวและการกระทำของผู้ขับขี่บนท้องถนน ซึ่งในสถานการณ์เหล่านี้ Synthetic Data อาจมีประโยชน์ในการสร้างการแสดงออกทางสีหน้า สีผิวและพื้นผิว รวมถึงองค์ประกอบอื่น ๆ เพิ่มเติม เช่น หมวก หน้ากาก และแว่นกันแดด นอกจากนี้ ADAS ยังต้องการฝึก AI ให้สามารถทำงานได้ในสภาพแสงน้อย เช่น การขับขี่ในที่มืด

ลดความท้าทายด้านการทำให้ข้อมูลไม่ระบุตัวตน

ความพยายามในการไม่ระบุตัวตนในข้อมูลและปลดข้อมูลประจำตัวของชุดข้อมูลแบบแมนนวล (หรือการลบข้อมูลที่เชื่อมโยงฐานข้อมูลของบุคคลใดบุคคลหนึ่ง) เป็นงานที่ต้องใช้เวลาและกำลังคนจำนวนมากและมีแนวโน้มเกิดข้อผิดพลาด โดยในท้ายที่สุดแนวทางนี้อาจทำให้โครงการเกิดความล่าช้าและต้องต่อเวลาของรอบการวนซ้ำในการพัฒนาอัลกอริทึมรวมถึงโมเดลแมชชีนเลิร์นนิ่ง (ML) ซึ่ง Synthetic Data สามารถจัดการกับปัญหาเหล่านี้ได้หลายประการ ด้วยการให้การเข้าถึงข้อมูลที่รวดเร็ว ค่าใช้จ่ายต่ำกว่าและง่ายกว่า โดยข้อมูลดังกล่าวจะคล้ายคลึงกับแหล่งที่มาของข้อมูลดั้งเดิม เหมาะสมต่อการใช้งาน และปกป้องความเป็นส่วนตัว

นอกจากนี้ หากเกิดกรณีข้อมูลที่ไม่ระบุตัวตนไปรวมกับแหล่งข้อมูลสาธารณะอื่น ๆ ก็จะเกิดความเสี่ยงที่ข้อมูลถูกเปิดเผยโดยไม่ตั้งใจ และอาจนำไปสู่การระบุข้อมูลที่ซ้ำซ้อนและละเมิดความเป็นส่วนตัวของข้อมูลได้ ผู้บริหารสามารถใช้เทคนิคต่าง ๆ เช่น ตั้งค่าความเป็นส่วนตัวที่แตกต่างกัน เพื่อให้แน่ใจว่าข้อมูลสังเคราะห์ใด ๆ ที่สร้างจากข้อมูลจริงนั้นมีความเสี่ยงต่ำมากเมื่อมีการทำให้ไม่ระบุตัวตน

ความท้าทายที่ขวางการนำไปใช้อย่างแพร่หลาย

การสร้างชุดข้อมูลแบบตารางสังเคราะห์เกี่ยวข้องกับการรักษาสมดุลระหว่างความเป็นส่วนตัวและการนำไปใช้ประโยชน์เพื่อให้แน่ใจว่าข้อมูลยังมีประโยชน์และตรงกับชุดข้อมูลดั้งเดิมอย่างถูกต้อง หากเน้นการใช้ประโยชน์สูงเกินไป ความเป็นส่วนตัวอาจได้รับผลกระทบ โดยเฉพาะอย่างยิ่งข้อมูลที่มีลักษณะเฉพาะไม่เหมือนใคร เนื่องจากชุดข้อมูลสังเคราะห์อาจจับคู่กับแหล่งข้อมูลอื่นได้ แต่ในทางกลับกัน วิธีการเพิ่มความเป็นส่วนตัว เช่น การตัดการเชื่อมต่อคุณลักษณะบางอย่างหรือการแนะนำ “สัญญาณรบกวน” ผ่านความเป็นส่วนตัวที่แตกต่างกัน อาจทำให้ประโยชน์ของชุดข้อมูลลดลงโดยปริยาย

ช่วงหลายทศวรรษที่ผ่านมาทั้งการจัดการข้อมูลและคุณภาพข้อมูลธุรกรรมที่ต่ำเป็นความท้าทายที่เกิดขึ้นต่อเนื่อง ตัวอย่างเช่น เจ้าหน้าที่ Call Center ที่อาจไม่สามารถกรอกข้อมูลที่อยู่หรือข้อมูลลูกค้าให้ครบถ้วนได้ โดยข้อมูลที่ขาดหายไปนี้เป็นอุปสรรคต่อการวิเคราะห์ ดังนั้นเพื่อแก้ไขปัญหานี้ องค์กรไอทีจำเป็นต้องให้ความรู้แก่ผู้ใช้บริการฝั่งธุรกิจทำความเข้าใจถึงความสำคัญของคุณภาพข้อมูลที่ดีทั้งเพื่อการสมัครใช้และนำมาวิเคราะห์ ซึ่งการใส่ข้อมูลขยะเข้าสู่ระบบจะนำมาสู่ผลลัพธ์ที่เป็นขยะ หรือที่เรียกว่า “Garbage In Garbage Out” ซึ่งเป็นหลักการที่ได้รับการยอมรับโดยทั่วไป อย่างไรก็ตาม ณ ปัจจุบัน เรื่องนี้ส่งผลต่อทัศนคติของผู้คนที่มีต่อ Synthetic Data เนื่องจากพวกเขาเชื่อว่าข้อมูลนั้นด้อยกว่า เพราะมันไม่ใช่ข้อมูลจริง ๆ ซึ่งทำให้การนำไปใช้งานล่าช้า ทว่าในความเป็นจริงแล้ว ข้อมูลสังเคราะห์อาจดีกว่าข้อมูลจริงก็ได้ ไม่ใช่ในแง่ที่ว่ามันสะท้อนความจริงในปัจจุบัน แต่คือในแง่ที่ว่ามันสามารถฝึกโมเดล AI ให้ทำงานกับโลกในอุดมคติหรือโลกในอนาคตได้อย่างไรต่างหาก

ชุดข้อมูลสังเคราะห์คือภาพสะท้อนของชุดข้อมูลดั้งเดิม ดังนั้นหากชุดข้อมูลเดิมไม่มีปัญหาในการโปรแกรมคอมพิวเตอร์ หรือมีความผิดปกติที่เรียกว่า “Edge Cases” เหตุการณ์เหล่านี้จะไม่ปรากฏในชุดข้อมูลสังเคราะห์เช่นกัน ดังนั้นข้อมูลสังเคราะห์ที่เป็นภาพและวิดีโอ อาทิ การขับขี่อัตโนมัติ ซึ่งใช้ภาพการขับรถหลายชั่วโมงในการฝึก AI จึงมีความสำคัญเฉพาะอย่างยิ่ง อย่างไรก็ตามยังจำเป็นต้องสร้างสถานการณ์ที่ไม่ปกติ อาทิ รถฉุกเฉิน การขับรถบนหิมะ หรือเจอกับสัตว์บนท้องถนน

Gartner