my video
AI-powered archive that brings forgotten voices from history back to life.
art
Too many historical voices—especially from marginalized groups—are lost or buried in inaccessible archives. This project tackles the frustration of how history often reflects only the dominant narrative, leaving behind stories that could challenge, enrich, or even reshape our understanding of the past. We want to unlock those voices using AI to find, translate, and narrate them in compelling, accessible ways.
Funding is definitely part of it—digitizing, translating, and processing historical archives isn’t cheap—but it’s not the only issue.
First, many of these materials are fragmented, poorly cataloged, or written in obscure dialects and scripts that few researchers can decipher without significant effort. That creates a huge accessibility bottleneck.
Second, institutional inertia plays a role. Archives and academic institutions are often slow to adopt new technologies or methods, especially ones like AI that challenge traditional scholarship.
Third, there’s a lack of interdisciplinary collaboration. Historians, linguists, technologists, and data scientists don’t naturally form project teams, so the innovation potential gets stuck in silos.
Lastly, there’s a perception issue: these forgotten voices aren’t always seen as “valuable” by mainstream funders or audiences. The stories are there, but the incentive to unearth them hasn’t been strong enough—yet.
Best case? We build a platform that radically democratizes access to historical narratives. Imagine an AI system that can comb through thousands of handwritten letters, court records, or diaries in dozens of languages, extract stories, and bring them to life as interactive audio or visual experiences—accurately, respectfully, and in context.
The impact would be cultural, educational, and even political. Educators could use it to teach more inclusive histories. Communities could reclaim silenced parts of their heritage. Scholars would gain tools to analyze patterns and voices previously buried by time or neglect.
It could reshape how we relate to history—less as a static record, more as a living, evolving conversation that includes everyone, not just the loudest or most powerful.
With 10x the funding, we could turn a powerful prototype into a global, multilingual platform with massive reach and depth.
Here’s what we’d do differently:
1. Scale the Archive Network – We’d partner with more global institutions, especially in underrepresented regions (e.g., Africa, Southeast Asia, Indigenous communities). This means unlocking stories that have never been digitized, let alone studied.
2. Build Smarter AI Models – We’d train domain-specific AI on historical scripts, regional dialects, and non-standardized spellings—making it vastly better at interpreting messy, real-world archives.
3. Multimodal Outputs – With more funding, we’d expand beyond text—using voice synthesis, animation, and immersive experiences (AR/VR) to bring stories to life for broader audiences.
4. Open Access Tools for Researchers – We’d create APIs and interfaces tailored for historians, educators, and artists, not just coders or AI geeks.
5. Long-Term Preservation – We’d invest in ethical, decentralized storage to ensure recovered histories are not just discovered—but preserved and owned by the communities they came from.
The outcome? This wouldn’t just be a tool or archive. It would become a cultural infrastructure project—something that rewires how future generations access and understand our shared human story. Not just more inclusive history, but more alive history.
Execution will follow a phased, agile approach with a core interdisciplinary team and key institutional partners.
Phase 1 – Foundation & MVP (Months 1–6):
• Data Acquisition: We’ll partner with select libraries and archives to digitize and access pilot collections, especially marginalized sources (letters, journals, court transcripts).
• Core Tech Stack: Build and fine-tune AI models for OCR, transcription, and translation of low-quality, historic texts.
• Prototype Interface: A searchable, browsable platform that outputs AI-enhanced narratives, with voice and contextual tags.
Phase 2 – Expansion & Storytelling (Months 7–18):
• Narrative Layer: Deploy natural language models to generate story summaries, timelines, and interactive content.
• Community Engagement: Work with cultural organizations and educators to co-create public exhibits, lesson plans, and interactive tools.
• Open API: Release a public-facing API so others can plug into our dataset and tools.
Phase 3 – Global Scaling (Months 19–36):
• Add more languages, partnerships, and AI training for specific scripts (e.g., Arabic calligraphy, Hanzi variants, etc.).
• Support user uploads from communities with undocumented archives (family letters, oral history transcripts).
• Secure sustainable infrastructure and governance for long-term preservation and accessibility.
Team:
• Project Lead (me): Strategy, partnerships, AI product vision.
• AI Engineer: OCR, NLP, and voice model training.
• Historian/Archivist: Source curation, context validation, and ethical review.
• UX Designer: Interface design focused on accessibility and storytelling.
• Community Liaison: Builds relationships with underrepresented groups and ensures culturally sensitive storytelling.
Outside help includes academic advisors, open-source contributors, and potentially cultural institutions willing to share closed archives under ethical use agreements.
This isn’t just a tech project—it’s part cultural restoration, part infrastructure-building, and part storytelling revolution.
Here’s a juicy operational detail: the AI models often fail hardest not on ancient languages, but on modern handwriting with historical quirks.
Most outsiders assume decoding 18th-century letters or medieval texts is the toughest part. But actually, a huge operational challenge comes from 19th and early 20th century documents—especially those written in cursive or semi-formal script by everyday people. These texts mix evolving spelling conventions, local slang, inconsistent grammar, and regional handwriting styles that confuse standard OCR and NLP systems.
To handle this, we’re building a crowdsourced handwriting style bank—a kind of living dataset where people can contribute labeled examples of their ancestors’ writing. This lets the AI “learn” styles unique to certain regions or time periods, dramatically improving accuracy and reducing hallucinations.
It’s a clever mix of machine learning and human history—something no outsider expects until they see the model trip over a 1932 grocery list from rural France.
Broadly, we’re looking at a 2.5 to 3-year timeline, broken into strategic milestones:
⸻
Year 1 – Proof of Concept & Infrastructure Build
• Secure access to archives and finalize data agreements
• Build baseline AI pipeline (OCR, translation, summarization)
• Launch a basic, browsable demo with 1–2 curated collections
• Key risk: archival permissions and licensing delays—these can drag for months, depending on bureaucracy
⸻
Year 2 – Expansion & Public Engagement
• Train AI on more scripts/languages using custom datasets
• Deploy storytelling and community tools (interactive timelines, voice narration)
• Partner with educators and local museums to pilot use cases
• Risk factor: ensuring ethical handling of sensitive or contested narratives—requires time, consultation, and care
⸻
Year 3 – Scale, Sustain, and Open Up
• Open up public contribution tools (upload, annotate, tag)
• Expand global partnerships (especially in the Global South)
• Begin archiving and preserving user-submitted materials
• Move toward decentralized, open-source stewardship of the platform
• Hard deadline: If grant includes public launch expectations (e.g., a demo for a cultural event or educational cycle), we’ll align major outputs with that—likely in Q2–Q3 of Year 3
⸻
The biggest wild cards are archive access speed, ethical review cycles, and AI model bottlenecks (e.g., hallucination issues or biases). We’re building with enough flexibility to reroute when needed—but we’re not under illusion that it’ll all run smooth. Real history never does.
Yes—we’ve been quietly building the foundation for about 8 months, mostly as a side project between a few technologists, historians, and archivists.
What we’ve learned so far:
1. Old documents are dirty data nightmares. Even well-scanned images come with noise: faded ink, bleed-through, handwritten annotations, stamps over text. We’ve had to invest way more time in preprocessing than expected.
2. Context matters way more than content. Early NLP summaries were technically accurate but missed cultural nuance or tone. A letter written by a 1910s domestic worker in Chile doesn’t read like a modern op-ed—it reads like survival. We’ve started experimenting with hybrid models that include historical context flags in the input stream.
3. People want to help. Once we explained the vision to a few niche online communities—genealogists, archivists, local historians—we got tons of support. Volunteers offered transcriptions, handwriting samples, even rare family documents. That kind of grassroots enthusiasm convinced us to bake in a strong participatory layer from the start.
4. Institutional resistance is real. Some archives see AI as a threat—either to control, to jobs, or to scholarly authority. We’re learning how to frame this as a complement, not a replacement.
In short: we’re not guessing. We’ve already scraped our knees on the hard parts—and we’re more confident than ever it’s worth pushing forward.
We’re not just uniquely qualified—we’re weirdly, stubbornly built for this exact problem. Here’s why:
1. Interdisciplinary DNA – Our team isn’t just techies dabbling in history or scholars outsourcing code. We’re a tight crew of AI engineers, digital humanists, linguists, and community archivists who actually speak each other’s languages. That’s rare—and crucial for a project that sits at the crossroads of culture and computation.
2. Track Record in Building Scrappy Systems – We’ve already built custom OCR pipelines, fine-tuned LLMs for noisy historical data, and stitched together multilingual datasets from scratch. We know how to make underfunded tech sing.
3. Deep Respect for the Subject Matter – This isn’t a novelty for us. We’re not here to automate history into clickbait. We’re obsessed with getting it right—which means culturally sensitive outputs, traceable sources, and human-in-the-loop review.
4. Global Reach + Local Roots – Between us, we speak 7+ languages, have connections with institutions in Latin America, Southeast Asia, and Eastern Europe, and collaborate with people from communities that aren’t usually on the tech world’s radar. We’re building this platform to serve them, not just Silicon Valley or academia.
5. We’ve Already Started – This isn’t theoretical. We’re already doing the work—on nights, weekends, and borrowed compute. The grant won’t fund a fantasy; it will accelerate a living project.
In short: this isn’t just a job for us—it’s a mission. And we’ve got the scars (and skills) to prove it.
The project idea came from a collision of two frustrations—one technical, one personal.
On the tech side, we were experimenting with AI for historical document transcription and realized how poorly it handled anything outside standardized, well-documented sources. Try running OCR on a 1920s Tagalog diary or a letter written in Yiddish cursive. The models fall apart. That got us asking: whose stories are being excluded by our tools?
Then the personal side hit. One teammate was digging into their grandmother’s handwritten war letters from Korea. Another found old records from a Jamaican ancestor, barely legible and ignored by historians. These weren’t just artifacts—they were full lives, silenced not by malice, but by poor tooling and institutional neglect.
So we asked: what if we could build something that didn’t just preserve these voices, but resurrected them—ethically, accurately, and accessibly? Not for profit. Not for clicks. Just because they deserve to be heard.
As for domain expertise—yes, we’ve got it:
• One of us has a PhD in archival studies with a focus on postcolonial documentation systems.
• Another has worked on low-resource NLP pipelines for underrepresented languages.
• And a third has led machine learning initiatives for museum digitization projects.
We didn’t fall into this by accident. We’ve been circling this problem for years. This project is just the form our obsession finally decided to take.
We previously built a custom OCR and NLP pipeline to extract, translate, and annotate over 100,000 handwritten letters from 19th-century immigrant communities—using minimal resources and open-source tools—which is now being used by two university archives. That project taught us how to work across messy data, limited funding, and institutional red tape to deliver something both technically solid and historically meaningful.
AI Development & Infrastructure – 40%
• Training/fine-tuning OCR and NLP models for historical data
• Cloud compute costs (GPU time, storage, APIs)
• Custom tooling for handwriting variation, dialect tagging, etc.
Archival Access & Licensing – 20%
• Fees for digitization, licensing rare or restricted archives
• Travel or remote coordination with smaller, underfunded collections
• Legal/ethical review for culturally sensitive content
Community Engagement & Content Creation – 15%
• Honoraria for community contributors (transcribers, narrators, advisors)
• Development of educational materials, storytelling outputs, and local exhibits
• Translation/localization efforts for multilingual access
UX & Platform Development – 15%
• Front-end interface for browsing, listening, and interacting with stories
• API and backend development for researcher access
• Accessibility features (e.g., screen-reader compatibility, voice output)
Project Management & Overhead – 10%
• Coordination, compliance, and reporting
• Basic admin costs, part-time PM or ops support
• Contingency buffer for unexpected costs
⸻
If we get more than baseline funding, we’d scale up compute power, add more languages, and build a more immersive storytelling layer (e.g., voice cloning with historical accents). But even at base level, this budget lets us deliver a functioning, impactful system.
my video
AI-powered archive that brings forgotten voices from history back to life.
art
Too many historical voices—especially from marginalized groups—are lost or buried in inaccessible archives. This project tackles the frustration of how history often reflects only the dominant narrative, leaving behind stories that could challenge, enrich, or even reshape our understanding of the past. We want to unlock those voices using AI to find, translate, and narrate them in compelling, accessible ways.
Funding is definitely part of it—digitizing, translating, and processing historical archives isn’t cheap—but it’s not the only issue.
First, many of these materials are fragmented, poorly cataloged, or written in obscure dialects and scripts that few researchers can decipher without significant effort. That creates a huge accessibility bottleneck.
Second, institutional inertia plays a role. Archives and academic institutions are often slow to adopt new technologies or methods, especially ones like AI that challenge traditional scholarship.
Third, there’s a lack of interdisciplinary collaboration. Historians, linguists, technologists, and data scientists don’t naturally form project teams, so the innovation potential gets stuck in silos.
Lastly, there’s a perception issue: these forgotten voices aren’t always seen as “valuable” by mainstream funders or audiences. The stories are there, but the incentive to unearth them hasn’t been strong enough—yet.
Best case? We build a platform that radically democratizes access to historical narratives. Imagine an AI system that can comb through thousands of handwritten letters, court records, or diaries in dozens of languages, extract stories, and bring them to life as interactive audio or visual experiences—accurately, respectfully, and in context.
The impact would be cultural, educational, and even political. Educators could use it to teach more inclusive histories. Communities could reclaim silenced parts of their heritage. Scholars would gain tools to analyze patterns and voices previously buried by time or neglect.
It could reshape how we relate to history—less as a static record, more as a living, evolving conversation that includes everyone, not just the loudest or most powerful.
With 10x the funding, we could turn a powerful prototype into a global, multilingual platform with massive reach and depth.
Here’s what we’d do differently:
1. Scale the Archive Network – We’d partner with more global institutions, especially in underrepresented regions (e.g., Africa, Southeast Asia, Indigenous communities). This means unlocking stories that have never been digitized, let alone studied.
2. Build Smarter AI Models – We’d train domain-specific AI on historical scripts, regional dialects, and non-standardized spellings—making it vastly better at interpreting messy, real-world archives.
3. Multimodal Outputs – With more funding, we’d expand beyond text—using voice synthesis, animation, and immersive experiences (AR/VR) to bring stories to life for broader audiences.
4. Open Access Tools for Researchers – We’d create APIs and interfaces tailored for historians, educators, and artists, not just coders or AI geeks.
5. Long-Term Preservation – We’d invest in ethical, decentralized storage to ensure recovered histories are not just discovered—but preserved and owned by the communities they came from.
The outcome? This wouldn’t just be a tool or archive. It would become a cultural infrastructure project—something that rewires how future generations access and understand our shared human story. Not just more inclusive history, but more alive history.
Execution will follow a phased, agile approach with a core interdisciplinary team and key institutional partners.
Phase 1 – Foundation & MVP (Months 1–6):
• Data Acquisition: We’ll partner with select libraries and archives to digitize and access pilot collections, especially marginalized sources (letters, journals, court transcripts).
• Core Tech Stack: Build and fine-tune AI models for OCR, transcription, and translation of low-quality, historic texts.
• Prototype Interface: A searchable, browsable platform that outputs AI-enhanced narratives, with voice and contextual tags.
Phase 2 – Expansion & Storytelling (Months 7–18):
• Narrative Layer: Deploy natural language models to generate story summaries, timelines, and interactive content.
• Community Engagement: Work with cultural organizations and educators to co-create public exhibits, lesson plans, and interactive tools.
• Open API: Release a public-facing API so others can plug into our dataset and tools.
Phase 3 – Global Scaling (Months 19–36):
• Add more languages, partnerships, and AI training for specific scripts (e.g., Arabic calligraphy, Hanzi variants, etc.).
• Support user uploads from communities with undocumented archives (family letters, oral history transcripts).
• Secure sustainable infrastructure and governance for long-term preservation and accessibility.
Team:
• Project Lead (me): Strategy, partnerships, AI product vision.
• AI Engineer: OCR, NLP, and voice model training.
• Historian/Archivist: Source curation, context validation, and ethical review.
• UX Designer: Interface design focused on accessibility and storytelling.
• Community Liaison: Builds relationships with underrepresented groups and ensures culturally sensitive storytelling.
Outside help includes academic advisors, open-source contributors, and potentially cultural institutions willing to share closed archives under ethical use agreements.
This isn’t just a tech project—it’s part cultural restoration, part infrastructure-building, and part storytelling revolution.
Here’s a juicy operational detail: the AI models often fail hardest not on ancient languages, but on modern handwriting with historical quirks.
Most outsiders assume decoding 18th-century letters or medieval texts is the toughest part. But actually, a huge operational challenge comes from 19th and early 20th century documents—especially those written in cursive or semi-formal script by everyday people. These texts mix evolving spelling conventions, local slang, inconsistent grammar, and regional handwriting styles that confuse standard OCR and NLP systems.
To handle this, we’re building a crowdsourced handwriting style bank—a kind of living dataset where people can contribute labeled examples of their ancestors’ writing. This lets the AI “learn” styles unique to certain regions or time periods, dramatically improving accuracy and reducing hallucinations.
It’s a clever mix of machine learning and human history—something no outsider expects until they see the model trip over a 1932 grocery list from rural France.
Broadly, we’re looking at a 2.5 to 3-year timeline, broken into strategic milestones:
⸻
Year 1 – Proof of Concept & Infrastructure Build
• Secure access to archives and finalize data agreements
• Build baseline AI pipeline (OCR, translation, summarization)
• Launch a basic, browsable demo with 1–2 curated collections
• Key risk: archival permissions and licensing delays—these can drag for months, depending on bureaucracy
⸻
Year 2 – Expansion & Public Engagement
• Train AI on more scripts/languages using custom datasets
• Deploy storytelling and community tools (interactive timelines, voice narration)
• Partner with educators and local museums to pilot use cases
• Risk factor: ensuring ethical handling of sensitive or contested narratives—requires time, consultation, and care
⸻
Year 3 – Scale, Sustain, and Open Up
• Open up public contribution tools (upload, annotate, tag)
• Expand global partnerships (especially in the Global South)
• Begin archiving and preserving user-submitted materials
• Move toward decentralized, open-source stewardship of the platform
• Hard deadline: If grant includes public launch expectations (e.g., a demo for a cultural event or educational cycle), we’ll align major outputs with that—likely in Q2–Q3 of Year 3
⸻
The biggest wild cards are archive access speed, ethical review cycles, and AI model bottlenecks (e.g., hallucination issues or biases). We’re building with enough flexibility to reroute when needed—but we’re not under illusion that it’ll all run smooth. Real history never does.
Yes—we’ve been quietly building the foundation for about 8 months, mostly as a side project between a few technologists, historians, and archivists.
What we’ve learned so far:
1. Old documents are dirty data nightmares. Even well-scanned images come with noise: faded ink, bleed-through, handwritten annotations, stamps over text. We’ve had to invest way more time in preprocessing than expected.
2. Context matters way more than content. Early NLP summaries were technically accurate but missed cultural nuance or tone. A letter written by a 1910s domestic worker in Chile doesn’t read like a modern op-ed—it reads like survival. We’ve started experimenting with hybrid models that include historical context flags in the input stream.
3. People want to help. Once we explained the vision to a few niche online communities—genealogists, archivists, local historians—we got tons of support. Volunteers offered transcriptions, handwriting samples, even rare family documents. That kind of grassroots enthusiasm convinced us to bake in a strong participatory layer from the start.
4. Institutional resistance is real. Some archives see AI as a threat—either to control, to jobs, or to scholarly authority. We’re learning how to frame this as a complement, not a replacement.
In short: we’re not guessing. We’ve already scraped our knees on the hard parts—and we’re more confident than ever it’s worth pushing forward.
We’re not just uniquely qualified—we’re weirdly, stubbornly built for this exact problem. Here’s why:
1. Interdisciplinary DNA – Our team isn’t just techies dabbling in history or scholars outsourcing code. We’re a tight crew of AI engineers, digital humanists, linguists, and community archivists who actually speak each other’s languages. That’s rare—and crucial for a project that sits at the crossroads of culture and computation.
2. Track Record in Building Scrappy Systems – We’ve already built custom OCR pipelines, fine-tuned LLMs for noisy historical data, and stitched together multilingual datasets from scratch. We know how to make underfunded tech sing.
3. Deep Respect for the Subject Matter – This isn’t a novelty for us. We’re not here to automate history into clickbait. We’re obsessed with getting it right—which means culturally sensitive outputs, traceable sources, and human-in-the-loop review.
4. Global Reach + Local Roots – Between us, we speak 7+ languages, have connections with institutions in Latin America, Southeast Asia, and Eastern Europe, and collaborate with people from communities that aren’t usually on the tech world’s radar. We’re building this platform to serve them, not just Silicon Valley or academia.
5. We’ve Already Started – This isn’t theoretical. We’re already doing the work—on nights, weekends, and borrowed compute. The grant won’t fund a fantasy; it will accelerate a living project.
In short: this isn’t just a job for us—it’s a mission. And we’ve got the scars (and skills) to prove it.
The project idea came from a collision of two frustrations—one technical, one personal.
On the tech side, we were experimenting with AI for historical document transcription and realized how poorly it handled anything outside standardized, well-documented sources. Try running OCR on a 1920s Tagalog diary or a letter written in Yiddish cursive. The models fall apart. That got us asking: whose stories are being excluded by our tools?
Then the personal side hit. One teammate was digging into their grandmother’s handwritten war letters from Korea. Another found old records from a Jamaican ancestor, barely legible and ignored by historians. These weren’t just artifacts—they were full lives, silenced not by malice, but by poor tooling and institutional neglect.
So we asked: what if we could build something that didn’t just preserve these voices, but resurrected them—ethically, accurately, and accessibly? Not for profit. Not for clicks. Just because they deserve to be heard.
As for domain expertise—yes, we’ve got it:
• One of us has a PhD in archival studies with a focus on postcolonial documentation systems.
• Another has worked on low-resource NLP pipelines for underrepresented languages.
• And a third has led machine learning initiatives for museum digitization projects.
We didn’t fall into this by accident. We’ve been circling this problem for years. This project is just the form our obsession finally decided to take.
We previously built a custom OCR and NLP pipeline to extract, translate, and annotate over 100,000 handwritten letters from 19th-century immigrant communities—using minimal resources and open-source tools—which is now being used by two university archives. That project taught us how to work across messy data, limited funding, and institutional red tape to deliver something both technically solid and historically meaningful.
AI Development & Infrastructure – 40%
• Training/fine-tuning OCR and NLP models for historical data
• Cloud compute costs (GPU time, storage, APIs)
• Custom tooling for handwriting variation, dialect tagging, etc.
Archival Access & Licensing – 20%
• Fees for digitization, licensing rare or restricted archives
• Travel or remote coordination with smaller, underfunded collections
• Legal/ethical review for culturally sensitive content
Community Engagement & Content Creation – 15%
• Honoraria for community contributors (transcribers, narrators, advisors)
• Development of educational materials, storytelling outputs, and local exhibits
• Translation/localization efforts for multilingual access
UX & Platform Development – 15%
• Front-end interface for browsing, listening, and interacting with stories
• API and backend development for researcher access
• Accessibility features (e.g., screen-reader compatibility, voice output)
Project Management & Overhead – 10%
• Coordination, compliance, and reporting
• Basic admin costs, part-time PM or ops support
• Contingency buffer for unexpected costs
⸻
If we get more than baseline funding, we’d scale up compute power, add more languages, and build a more immersive storytelling layer (e.g., voice cloning with historical accents). But even at base level, this budget lets us deliver a functioning, impactful system.