Skip to content

Conversation

@ziadhany
Copy link
Collaborator

@ziadhany ziadhany commented Jan 8, 2026

@ziadhany
Copy link
Collaborator Author

ziadhany commented Jan 8, 2026

INFO 2026-01-26 19:15:30.575619 UTC Pipeline [AlpineLinuxImporterPipeline] starting
INFO 2026-01-26 19:15:30.575748 UTC Step [collect_and_store_advisories] starting
Importing data using alpine_linux_importer_v2
INFO 2026-01-26 22:39:08.084020 UTC Successfully collected 108,252 advisories
INFO 2026-01-26 22:39:08.084139 UTC Step [collect_and_store_advisories] completed in 12218 seconds (3.4 hours)
INFO 2026-01-26 22:39:08.084171 UTC Pipeline completed in 12218 seconds (3.4 hours)

from vulnerabilities.models import AdvisoryV2
from django.db.models import Count
duplicates = (
    AdvisoryV2.objects
    .values('avid')
    .annotate(count=Count('id'))
    .filter(count__gt=1)
)
len(duplicates)
Out[2]: 0
AdvisoryV2.objects.count()
Out[3]: 108252

@ziadhany
Copy link
Collaborator Author

ziadhany commented Jan 15, 2026

@TG1999 @pombredanne I have a question about Alpine migration. We are fetching one URL and processing the data without grouping by CVE.

The problem is that each URL reports a package version along with its fixed CVEs. How can we obtain a unique identifier for this importer? Is it a good idea to restructure the data and create a large mapping, using the CVE as the unique identifier?

Proposed structure:
CVE: [purl_1, purl_2, ...]

Example:
Package: aom

Sources:
https://secdb.alpinelinux.org/v3.22/main.json -> CVEs: "CVE-2021-30473", "CVE-2021-30474", "CVE-2021-30475"
https://secdb.alpinelinux.org/v3.21/main.json -> CVEs: "CVE-2021-30473", "CVE-2021-30474", "CVE-2021-30475"

)

for cve in aliases:
advisory_id = f"{pkg_infos['name']}/{qualifiers['distroversion']}/{cve}"
Copy link
Collaborator Author

@ziadhany ziadhany Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ex:

apache2/v3.20/2.4.26-r0/CVE-2017-7668

@ziadhany
Copy link
Collaborator Author

ziadhany commented Jan 28, 2026

The logs in debug mode:

alpine.zip

@ziadhany ziadhany requested a review from keshav-space January 28, 2026 13:50
Copy link
Member

@keshav-space keshav-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ziadhany, see comments below.

Comment on lines 74 to 121
def fetch_advisory_directory_links(
page_response_content: str,
base_url: str,
logger: callable = None,
) -> List[str]:
"""
Return a list of advisory directory links present in `page_response_content` html string
"""
index_page = BeautifulSoup(page_response_content, features="lxml")
alpine_versions = [
link.text
for link in index_page.find_all("a")
if link.text.startswith("v") or link.text.startswith("edge")
]

if not alpine_versions:
if logger:
logger(
f"No versions found in {base_url!r}",
level=logging.DEBUG,
)
return []

advisory_directory_links = [urljoin(base_url, version) for version in alpine_versions]

return advisory_directory_links


def fetch_advisory_links(
advisory_directory_page: str,
advisory_directory_link: str,
logger: callable = None,
) -> Iterable[str]:
"""
Yield json file urls present in `advisory_directory_page`
"""
advisory_directory_page = BeautifulSoup(advisory_directory_page, features="lxml")
anchor_tags = advisory_directory_page.find_all("a")
if not anchor_tags:
if logger:
logger(
f"No anchor tags found in {advisory_directory_link!r}",
level=logging.DEBUG,
)
return iter([])
for anchor_tag in anchor_tags:
if anchor_tag.text.endswith("json"):
yield urljoin(advisory_directory_link, anchor_tag.text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ziadhany this is bit brittle. I've created a mirror for Alpine secdb here https://github.com/aboutcode-org/aboutcode-mirror-alpine-secdb let's use this instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I’ll update the code. I didn’t notice we have a mirror

return (cls.collect_and_store_advisories,)

def advisories_count(self) -> int:
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return count based on packages key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this? The problem is that we create an AdvisoryData entry for every CVE.
For example (not related): CVE-2019-3828, CVE-2020-1733.

https://nvd.nist.gov/vuln/detail/CVE-2019-3828
https://nvd.nist.gov/vuln/detail/CVE-2020-1733

  "packages": [
    {
      "pkg": {
        "name": "ansible",
        "secfixes": {
          "2.6.3-r0": [
            "CVE-2018-10875"
          ],
          "2.7.9-r0": [
            "CVE-2018-16876"
          ],
          "2.8.11-r0": [
            "CVE-2019-3828",
            "CVE-2020-1733",
            "CVE-2020-1740"
          ],

getting the correct count means we should loop over every package alias.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ziadhany since we already have all the advisory files locally, we can instead return the count of CVEs from these files.
Perhaps we can return something like this?

sum(len(re.findall(r'\bCVE-\d{4}-\d+\b', a.read_text())) for a in secdb.rglob("*.json"))

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
…aseImporterPipelineV2

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Fix duplication on advisory_id

Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Signed-off-by: ziad hany <ziadhany2016@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants